0$, and
$E|Z|^p=\|f\|_p^p$, $E|Z|^q=\|f\|_q^q$. Fix some numbers $1 1$ let us apply inequality (11.5) in the already
proven case $p=2$ for the function $f^{p/2}$. We get that
$\frac p2\int f^p(x)\ln f(x)\mu(\,dx)\le \int
f^{p/2}(x)Bf^{p/2}(x)\mu(dx)+\frac p2\|f\|_p^p\ln\|f\|_p$.
Hence to prove Proposition~11.4 in the general case it is enough
to show that
$$
\int f^{p/2}(x)Bf^{p/2}(x)\mu(\,dx)\le\frac{p^2}{4(p-1)}
\int f^{p-1}(x) Bf(x)\mu(\,dx)
$$
for a function $f(x)=a+br_1(x)$ such that $a\ge|b|$.
The expressions in the last inequality can be simply calculated. As
$$
f^{p/2}(x)=\[\(\frac{a+b}2\)^{p/2}+\(\frac{a-b}2\)^{p/2}\]
+\[\(\frac{a+b}2\)^{p/2}-\(\frac{a-b}2\)^{p/2}\]r_1(x),
$$
and
$$
f^{p-1}(x)=\[\(\frac{a+b}2\)^{p-1}+\(\frac{a-b}2\)^{p-1}\]
+\[\(\frac{a+b}2\)^{p-1}-\(\frac{a-b}2\)^{p-1}\]r_1(x)
$$
this inequality can be rewritten as
$$
\align
&\[\(\frac{a+b}2\)^{p/2}-\(\frac{a-b}2\)^{p/2}\]^2 \\
&\quad \le \frac{p^2}{4(p-1)}
\[\(\frac{a+b}2\)^{p-1}-\(\frac{a-b}2\)^{p-1}\]
\(\frac{a+b}2-\frac{a-b}2\)
\endalign
$$
or
$$
\(\int_u^v t^{(p-2)/2}\,dt\)^2\le \int_u^v t^{p-2}\,dt\cdot
\int_u^v 1 \,dt
$$
with $u=\frac{a-|b|}2$ and $v=\frac{a+|b|}2$. But the last formula is
a simple consequence of the Schwarz inequality. Proposition 11.4 is
proved.
\medskip\noindent
{\it Remark:} Theorem 11.3 is sharp in the following sense. The
transformation $T_\gamma$, $T_\gamma(a+br_1(x))=a+\gamma br_1(x)$ as
a transformation from the $L_q(X,\Cal X,\mu)$ space to the space
$L_p(X,\Cal X,\mu)$ with $11$, and
$t\ge0$. We want to prove that
$$
\[\int |U_tf(x)|^{p(t,q)}\mu(\,dx)\]^{1/p(t,q)}\le \[\int
|f(x)|^q\mu(\,dx)\]^{1/q} \quad \text {for all }t\ge0\tag11.6
$$
and functions $f$ on $X$. (The general theory helps to find the
`right' definition of the function $p(t,q)$. It is defined as the
solution of the differential equation $\frac p{2(p-1)}
\frac{dp(t)}{dt}=p$, $p(0)=q$. The coefficient $\frac p{2(p-1)}$ in
this equation agrees with the coefficient appearing in the
logarithmic Sobolev inequality (11.5).) Let us prove inequality
(11.6) first for such functions $f(x)=a+br_1(x)$ for which $a$ and $b$
are real numbers and $a\ge|b|$.
Given a function $f(x)=a+br_1(x)$ with $a\ge|b|$ define the function
$F(t)=\[\int (U_tf(x))^{p(t,q)}\mu(\,dx)\]^{1/p(t,q)}$. Observe
that $U_tf(x)=a+be^{-t}r_1(x)$, and $a\ge|b|e^{-t}$. Hence to prove
(11.6) it is enough to show that
$$
\frac{d\|U_t(f)\|_{p(t,q)}}{dt}=\frac{d F(t)}{dt}\le0 \quad
\text{for all }t>0 \tag11.7
$$
which means that the function $F(t)$ is monotone decreasing, and in
the proof we can apply the logarithmic Sobolev inequality for the
functions $f_t(x)=U_tf(x)$. We have
$$
\align
\frac{dF(t)}{dt}&=F(t)\biggl[-\frac{p'(t,q)}{p(t,q)}\ln F(t)
+\frac{p'(t,q)}{p(t,q)}\frac
{\int U_tf(x)^{p(t,q)} \ln U_t f(x)\mu(\,dx)}
{\int U_tf(x)^{p(t,q)}\mu(\,dx)} \\
&\qquad +\frac{\int U_tf(x)^{p(t,q)-1}(U_tf(x))'\mu(d\,x)}
{\int U_tf(x)^{p(t,q)}\mu(\,dx)}\biggr],
\endalign
$$
where $G(t,\cdot)'$ means partial derivative with respect to the
variable~$t$. Since $F(t)=\|U_t(f)\|_{p(t,q)}$,
$\int U_tf(x)^{p(t,q)}\mu(\,dx)=\| U_t(f)\|_{p(t,q)}^{p(t,q)}$,
$(U_tf(x))'=-BU_tf(x)$ by the definition of the operator $B$,
$$
\int U_tf(x)^{p(t,q)-1}(U_tf(x))'\mu(d\,x)=
-\int U_tf(x)^{p(t,q)-1}B(U_tf)(x)\mu(d\,x),
$$
and $\frac{p(t,q)}{p'(t,q)}=\frac{p(t,q)}{2(p(t,q)-1)}$ with our
choice of functions, the last formula implies that the inequality
$\frac{dF(t)}{dt}\le0$ is equivalent to the relation
$$
\align
&-\|U_t(f)\|_{p(t,q)}^{p(t,q)}\ln\|U_t(f)\|+\int U_tf^{p(t,q)}(x)\ln
U_tf(x)\mu(\,dx)\\
&\qquad-\frac p{2(p-1)}\int (U_tf)^{p(t,q)-1}(x)BU_tf(x)\mu((\,dx)\le0.
\endalign
$$
But this inequality follows from the logarithmic Sobolev inequality
if it is applied for the function $U_t(f)$ with $\bar p=p(t,q)$.
To prove relation (11.6) for a general function $f$ it is enough to
check that $|U_t(f)|\le U_t(|f|)$, i.e. $|U_t(f)(1)|\le U_t(|f|)(1)$
and $|U_t(f)(-1)|\le U_t(|f|)(-1)$ for arbitrary function $f$ and
$t\ge0$, since this relation has been already proved for the function
$|f|$. But this relation simply follows from the following calculation.
If $f(1)=A$, $f(-1)=B$, then $f(x)=\frac{A+B}2+\frac{A-B}2r_1(x)$,
$U_tf(x)=\frac{A+B}2+e^{-t}\frac{A-B}2r_1(x)$, i.e.
$U_tf(1)=\frac{1+e^{-t}}2A+\frac{1-e^{-t}}2B$, and
$U_tf(-1)=\frac{1-e^{-t}}2A+\frac{1+e^{-t}}2B$, while
$(U_t|f|)(\pm1)=\frac{1+e^{-t}}2|A|+\frac{1\mp e^{-t}}2|B|$.
Let us fix some numbers $1

\sqrt{\frac{q-1}{p-1}}$. To see this let us compare the
$L_q$ norm of $1+\delta r_1(x)$ with the $L_p$-norm of $T_\gamma
r_1(x)=1+\gamma\delta r_1(x)$ for a small parameter $\delta>0$. We
have $\|1+\delta
r_1(x)\|_q=\[\frac12\((1+\delta)^q+(1-\delta)^q\)\]^{1/q}
=\[1+\frac{q(q-1)}2\delta^2+O(\delta^3)\]^{1/q}=1+\frac{q-1}2\delta^2
+O(\delta^3)$. Similarly, $\|1+\gamma\delta r_1(x)\|_p=1+\frac{p-1}2
\gamma^2\delta^2+O(\delta^3)$, and these relations imply the above
remark.
\medskip\noindent {\script 11 b.) The proof of
Proposition 10.3.} \medskip\noindent
{\it Proof of Proposition 10.3.} Let us use the notation introduced
in the formulation of Proposition~10.3, and take another $k$
independent copies $\bar\xi_1^{(j)}$,\dots, $\bar\xi_n^{(j)}$,
$1\le j\le k$, of the random sequences
$\xi_1,\dots,\xi_n$ which are also independent of the sequence
$\e_1,\dots,\e_n$ appearing in the formulation of Proposition 10.3.
Let $\Cal F$ denote the $\sigma$-algebra generated by the random
variables $\xi^{(j)}_1,\dots,\xi^{(j)}_n$, $1\le j\le k$, and let
us introduce the notation $\xi^{(j,1)}_l=\xi^{(j)}_l$,
$\xi^{(j,-1)}_l=\bar\xi^{(j)}_l$, $1\le l\le n$ and $1\le j\le k$.
Let $\Cal V_k$ denote the set of $\pm1$ sequences of length $k$, and
for a $v\in\Cal V_k$ let $m(v)$ denote the number of the digits $-1$
in the sequence $v=(v(1),\dots,v(k))$. Observe that
$E\(f\left.\(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\)\right|
\Cal F\)=0$ if the $\pm1$ sequence $(v(1),\dots,v(k))$ contains at
least one coordinate $-1$, (this is the point of the proof where we
exploit the canonical property of the function $f$), and
$$
Ef\left.\(\xi_{l_1}^{(1,1)},\dots,\xi_{l_k}^{(k,1)}\right|\Cal F\)=
f\(\xi_{l_1}^{(1)},\dots,\xi_{l_k}^{(k)}\)\quad\text{for all indices
}1\le l_j\le n,\; 1\le j\le k.
$$
These relations together with the Jensen-inequality for conditional
expectations imply that
$$
\align
|\bar I_{n,k}(f)|^p&=\left|E\left.\(\frac1{k!}\sum_{v\in \Cal V_k}
(-1)^{m(v)} \!\! \summ\Sb 1\le l_r\le n,\;
r=1,\dots, k\\ l_r\neq l_{r'} \text{ if } r\neq r'\endSb \!\!
f\(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\)\right|\Cal
F\)\right|^p\\
&\le E\(\left|\left.\frac1{k!}\sum_{v\in \Cal V_k}
(-1)^{m(v)} \!\!\!\! \summ\Sb 1\le l_r\le n,\;
r=1,\dots, k\\ l_r\neq l_{r'} \text{ if } r\neq r'\endSb \!\!\!\!
f\(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\)\right|^p
\right|\Cal F\).
\endalign
$$
Hence
$$
E|\bar I_{n,k}(f)|^p\le
E\left|\frac1{k!}\sum_{v\in \Cal V_k}
(-1)^{m(v)} \summ\Sb 1\le l_r\le n,\; r=1,\dots, k\\
l_r\neq l_{r'} \text{ if } r\neq r'\endSb
f\(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\)\right|^p.
\tag11.8
$$
Let us introduce the random variables
$$
\tilde I_{n,k}(f)=\frac1{k!}\sum_{v\in \Cal V_k}
(-1)^{m(v)} \summ\Sb 1\le l_r\le n,\; r=1,\dots, k\\
l_r\neq l_{r'} \text{ if } r\neq r'\endSb
f\(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\) \tag11.9
$$
and
$$
\tilde I_{n,k}(f,\e)=\frac1{k!}\sum_{v\in \Cal V_k}
(-1)^{m(v)} \summ\Sb 1\le l_r\le n,\; r=1,\dots, k\\
l_r\neq l_{r'} \text{ if } r\neq r'\endSb \e_{l_1}\cdots\e_{l_k}
f\(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\). \tag$11.9'$
$$
Let us recall that the number $m(v)$ in these formula denotes the
number of the digits $-1$ in the $\pm1$ sequence $v$ of length $k$, i.e.
it counts how many random variables $\xi_{l_j}^{(j,1)}$, $1\le j\le k$,
were replaced by the `secondary copy' $\xi_{l_j}^{(j,-1)}$ in the
corresponding terms of the sum in (11.9) or $(11.9')$.
I claim that the above defined two random variables $\tilde
I_{n,k}(f)$ and $\tilde I_{n,k}(f,\e)$ have the same distribution.
This statement will be formulated in a slightly more general form which
will be useful in the further part of this work.
\medskip\noindent
{\bf Lemma 11.5.} {\it Let us consider a (non-empty) class of
functions $\Cal F$ of $k$ variables $f(x_1,\dots,x_k)$ on the space
$(X^k,\Cal X^k)$ together with the random variables $\tilde I_{n,k}(f)$
and $\tilde I_{n,k}(f,\e)$ defined in formulas (11.9) and $(11.9')$
for all $f\in \Cal F$. The joint distributions of the set of random
variables $\{\tilde I_{n,k}(f);\, f\in\Cal F\}$ and
$\{\tilde I_{n,k}(f,\e);\, f\in \Cal F\}$ agree.}
\medskip\noindent
{\it The proof of Lemma 11.5.}\/ We even claim that fixing an
arbitrary sequence $u=(u(1),\dots,u(n))$, $u(l)=\pm1$, $1\le l\le n$,
of length~$n$, the conditional distribution of the field
$\{\tilde I_{n,k}(f,\e);\,f\in \Cal F\}$ under the condition that
$(\e_1,\dots,\e_n)=u=(u(1),\dots,u(n))$ agrees with the
distribution of the field of $\{\tilde I_{n,k}(f);\, f\in\Cal F\}$.
Indeed, the random variables $\tilde I_{n,k}(f)$, $f\in\Cal F$, defined
in (11.9) are functions of a random vector consisting of coordinates
$(\xi_l^{(j)},\bar\xi_l^{(j)})=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$,
$1\le l\le n$, $1\le j\le k$, and the distribution of this random
vector does not change if we replace the coordinates
$(\xi_{l}^{(j)},\bar\xi_l^{(j)})=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$,
by $(\bar\xi_l^{(j)},\xi_l^{(j)})=(\xi^{(j,-1)}_l,\xi^{(j,1)}_l)$,
for those indices $(j,l)$ for which $u(l)=-1$ (independently of the
value of the parameter $j$) and do not modify these random vectors
for those coordinates
$(l,j)$ for which $u(l)=1$. Replacing the original vector in the
definition of the expression $\tilde I_{n,k}(f)$ in (11.9) for all
$f\in \Cal F$ by this modified vector we carry out a measure
preserving transformation. On the other hand, the random field we
get in such a way has the same distribution as the conditional
distribution of the random field $\tilde I_{n,k}(f,\e)$, $f\in\Cal F$,
with the elements defined in $(11.9')$ under the condition that
$(\e_1,\dots,\e_n)=u$ with $u=(u(1),\dots,u(n))$.
To prove the last statement let us observe that the conditional
distribution of the random field $\tilde I_{n,k}(f,\e)$,
$f\in\Cal F$, under the condition $(\e_1,\dots,\e_n)=u$ is the same
as that of the random field we obtain by putting $u_l=\e_l$, $1\le
l\le n$, in all coordinates $\e_l$ of the random variables $\tilde
I_{n,k}(f,\e)$. On the other hand, the random variables we get in
such a way agree with the random variables we get by carrying out
the above described transformation for the random variables
$\tilde I_{n,k}(f)$, only the terms in the sums defining these
random variables are listed in a different order.
\medskip
Relation (11.8) and the agreement of the distribution of the random
variables $\tilde I_{n,k}(f)$ in (11.9) and $\tilde I_{n,k}(f)$
$(11.9')$ imply that
$$
E|\bar I_{n,k}(f)|^p
\le E\left|\frac1{k!}\sum_{v\in \Cal V_k}
(-1)^{m(v)} \!\!\!\! \summ\Sb 1\le l_j\le n,\;
j=1,\dots,k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb \!\!\!\!
\e_{l_1}\cdots\e_{l_k}
f\(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\)\right|^p.
\tag11.10
$$
Let us define for all $v=(v(1),\dots,v(k))\in\Cal V_k$ the
random variable
$$
\bar I_{n,k,v}(f,\e)=\frac1{k!}\summ\Sb 1\le l_j\le n,\;j=1,\dots,
k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb
\e_{l_1}\cdots\e_{l_k}
f\(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\), \quad v\in V_k.
$$
The distribution of the random variables $\bar I_{n,k,v}(f,\e)$
agree with that of $\bar I_{n,k}(f,\e)$ introduced in (10.6) for all
$v\in \Cal V_k$. Hence relation (11.10) implies that
$$
\align
E|\bar I_{n,k}(f)|^p &\le E\left|\sum_{v\in \Cal V_k}
(-1)^{m(v)}\bar I_{n,k,v}(\e,f) \right|^p\\
&\le 2^{(k-1)p}\sum_{v\in\Cal V_k} E|\bar I_{n,k,v}(f,\e)|^p
=2^{kp}E|\bar I_{n,k}(f,\e)|^p.
\endalign
$$
Proposition 10.3 is proved.
\beginsection 12. Reduction of the main result in this work
The main result of this paper is Theorem 8.4 or its multiple integral
version Theorem~8.2. It can be considered as the multivariate version
of Theorem 4.1, and its proof is also based on a similar argument.
Following the method of the proof of Theorem~4.1 first we prove a
multivariate version of Proposition~6.1 in Proposition~12.1 and
reduce Theorem~8.4 to a simpler statement formulated in
Proposition~12.2.
The hard part of the problem is the proof of Proposition~12.2. In the
first step of its proof we reduce it with the help of Theorem 10.4
(proved by de la Pe\~na and Montgomery--Smith) to an analogous result
formulated in Proposition~$12.2'$, where the $U$-statistics to be
investigated are replaced by their decoupled $U$-statistics
counterpart introduced in Section~10. The proof of this result is
simpler, because here we have more independence. It is based on a
symmetrization argument, similar to the proof of Proposition~6.2.
The details of this symmetrization argument will be explained in the
next section. This section contains only an important preliminary
result needed in this argument, a multi-dimensional variant of
Hoeff\-ding's inequality (Theorem~ 3.4) formulated in Theorem~12.3.
It yields an estimate about the distribution of homogeneous
polynomials of Rademacher functions.
The first result of this Section, Proposition~12.1, can be proved in
almost the same way as its simplified version Proposition~6.1. The
only essential difference between their proof is that Bernstein's
inequality applied in the proof of Proposition 6.1 is replaced
now by its multivariate version Theorem~8.3. Theorem~12.1 can be
considered as the result we can get by means of the Theorem~8.3 and
the chaining argument. Its main content, formulated in relation~(12.1)
states that given a nice class of functions $\Cal F$ it has a
subclass $\Cal F_{\bar\sigma}$ of relatively small cardinality which
is also a relatively dense subclass of $\Cal F$ in the $L_2$ norm,
and the supremum of the $U$-statistics with kernel functions
from $\Cal F_{\bar\sigma}$ can be well bounded. To get an applicable
result we also need some estimates on the number $\bar\sigma$ which
measures how dense the subclass $\Cal F_{\bar\sigma}$ in $\Cal F$ is.
Such estimates are contained at the end of this Proposition.
In the formulation of Proposition~12.1 we introduce, similarly to
Proposition~6.1, two parameters $\bar A>2^k$ and $M=M(\Bar A,k)$,
and this may seem at first sight unnatural. But the introduction of
these parameters turned out to be useful, they help, similarly to
the analogous problem in Section~6 to fit the parameters in
Propositions 12.1 and~12.2 as we want to apply them simultaneously.
\medskip\noindent
{\bf Proposition 12.1.} {\it Let us have the $k$-fold power
$(X^k,\Cal X^k)$ of a measurable space $(X,\Cal X)$ with some
probability measure $\mu$ on $(X,\Cal X)$ and a countable $L_2$-dense
class $\Cal F$ of functions $f(x_1,\dots,x_k)$ of $k$ variables on
$(X^k,\Cal X^k)$ with parameter $D$ and exponent~$L$, $L\ge1$, such
that all functions $f\in\Cal F$ are canonical with respect to the
measure $\mu$, and they satisfy conditions (8.4) and (8.5) with
some real number $0<\sigma\le1$. Take a sequence of independent
$\mu$-distributed random variables $\xi_1,\dots,\xi_n$,
$n\ge\max(k,2)$, and consider the (degenerate) $U$-statistics
$I_{n,k}(f)$, $f\in \Cal F$, defined in formula (8.7). Let us fix
some number $\bar A\ge2^k$.
For all numbers $M=M(k,\bar A)$ which are chosen sufficiently large
in dependence of $\bar A$ and~$k$ the following relation depending
on the numbers $\bar A$ and $M$ holds: For all numbers $u>0$ for which
$n\sigma^2\ge\(\frac u\sigma\)^{2/k}\ge ML\log\frac2\sigma$ a number
$\bar\sigma=\bar\sigma(u)$, $0\le\bar\sigma\le \sigma\le1$, and a
collection of functions $\Cal F_{\bar\sigma}=\{f_1,\dots,f_m\}
\subset\Cal F$ with $m\le D\bar\sigma^{-L}$ elements can be chosen
in such a way that the sets $\Cal D_j=\{f\:f\in \Cal F,
\int|f-f_j|^2\,d\mu \le\bar\sigma^2\}$, $1\le j\le m$, satisfy the
relation $\bigcupp_{j=1}^m\Cal D_j=\Cal F$, and the (degenerate)
$U$-statistics $I_{n,k}(f)$, $f\in\Cal F_{\bar\sigma(u)}$, satisfy
the inequality
$$
\aligned
P&\(\sup_{f\in\Cal F_{\bar\sigma(u)}}n^{-k/2}|I_{n,k}(f)|\ge \frac
u{\bar A}\)\le 2CD\exp\left\{-\alpha\(\frac u{10\bar
A\sigma}\)^{2/k}\right\} \\
&\qquad \qquad \text{if}\quad n\sigma^2\ge \(\frac u\sigma\)^{2/k}
\ge ML\log\frac2\sigma
\endaligned \tag12.1
$$
with the constants $\alpha=\alpha(k)$, $C=C(k)$ appearing in
formula~(8.9) of Theorem~8.3 and the exponent $L$ and parameter $D$
of the $L_2$-dense class $\Cal F$.
The inequalities $4\(\frac u{\bar A\bar\sigma}\)^{2/k}\ge
n\bar\sigma^2\ge\frac1{64}\(\frac u{\bar A\sigma}\)^{2/k}$ and
$n\bar\sigma^2\ge\frac{M^{2/3}(L+\beta)\log n}{1000\bar A^{4/3}}$
also hold, provided that $n\sigma^2\ge \(\frac u\sigma\)^{2/k}\ge
M(L+\beta)^{3/2}\log\frac2\sigma$ with $\beta=\max\(\frac{\log
D}n,0\)$.}
\medskip\noindent
{\it Proof of Proposition 12.1.} Let us list the elements of the
countable set $\Cal F$ as $f_1,f_2,\dots$. For all $p=0,1,2,\dots$
let us choose, by exploiting the $L_2$-density property of the class
$\Cal F$, a set $\Cal F_p=\{f_{a(p,1)},\dots,f_{a(p,m_p)}\}\subset
\Cal F$ with $m_p\le D\,2^{2pL}\sigma^{-L}$ elements in such a way
that $\inff_{1\le j\le m_p}\int (f-f_{a(p,j)})^2\,d\mu\le
2^{-4p}\sigma^2$ for all $f\in\Cal F$.
For all indices $a(j,p)$, $p=1,2,\dots$, $1\le j\le m_p$, choose a
predecessor $a(j',p-1)$, $j'=j'(j,p)$, $1\le j'\le m_{p-1}$, in such a
way that the functions $f_{a(j,p)}$ and $f_{a(j',p-1)}$ satisfy the
relation $\int|f_{a(j,p)}-f_{a(j',p-1)}|^2\,d\mu\le \sigma^2
2^{-4(p-1)}$. Then we have
$\int\(\frac{f_{a(j,p)}-f_{a(j',p-1)}}2\)^2\,d\mu\le4\sigma^2 2^{-4p}$
and $\supp_{x_j\in X,\,1\le j\le k}\left|
\frac{f_{a(j,p)}(x_1,\dots,x_k)-f_{a(j',p-1)}(x_1,\dots,x_k)}2\right|
\le 1$. Theorem~8.3 yields that
$$
\aligned
P(A(j,p))&=P\(n^{-k/2}|I_{n,k}(f_{a(j,p)}-f_{a(j',p-1)})|\ge
\frac{2^{-(1+p)}u}{\bar A}\)\\
&\le C \exp\left\{-\alpha\(\frac{2^{p}u}{8\bar A
\sigma}\)^{2/k} \right\}
\quad \text {if}\quad 4n\sigma^2 2^{-4p}\ge\(\frac{2^{p}u}
{8\bar A\sigma}\)^{2/k}, \\
&\qquad\qquad 1\le j\le m_p,\; p=1,2,\dots,
\endaligned \tag12.2
$$
and
$$
\align
P(B(s))&=P\(n^{-k/2}|I_{n,k}(f_{0,s})|\ge \frac u{2\bar A}\)\le
C\exp\left\{-\alpha\(\frac u{2\bar A\sigma}\)^{2/k}\right\},
\quad 1\le s\le m, \\
&\qquad\qquad\qquad\text{if} \quad n\sigma^2\ge \(\frac u{2\bar
A\sigma}\)^{2/k}. \tag12.3
\endalign
$$
Introduce an integer $R=R(u)$, $R>0$, which satisfies the relations
$$
2^{(4+{2/k})(R+1)}\(\frac{u}{\bar A\sigma}\)^{2/k} \ge
2^{2+6/k}n\sigma^2\ge 2^{(4+2/k)R}\(\frac{u}{\bar A\sigma}\)^{2/k},
$$
and define $\bar\sigma^2=2^{-4R}\sigma^2$ and $\Cal
F_{\bar\sigma}=\Cal F_R$ (i.e the class of functions $\Cal F_p$
introduced before with $p=R$). (As $n\sigma^2\ge\(\frac u\sigma\)^{2/k}$
and $\bar A\ge2^k$ by our conditions, there exists such a
positive integer $R$.) Then the cardinality~$m$ of the set $\Cal
F_{\bar\sigma}$ is clearly not greater than $D\bar\sigma^{-L}$,
and $\bigcupp_{j=1}^m \Cal D_j=\Cal F$. Beside this, the number
$R$ was chosen in such a way that the inequalities
(12.2) and (12.3) hold for $1\le p\le R$. Hence the
definition of the predecessor of an index $a(j,p)$ implies that
$$
\align
&P\(\sup_{f\in\Cal F_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|\ge \frac u{\bar
A}\) \le P\(\bigcup_{p=1}^R\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^mB(s)\) \\
&\qquad \le \sum_{p=1}^R\sum_{j=1}^{m_p}P(A(j,p))+\sum_{s=1}^mP(B(s))
\le \sum_{p=1}^{\infty} CD\,2^{2pL}\sigma^{-L}
\exp\left\{-\alpha\(\frac{2^{p}u}{8\bar A\sigma}\)^{2/k}
\right\}\\
&\qquad\qquad +CD\sigma^{-L}\exp\left\{-\alpha\(\frac
u{2\bar A\sigma}\)^{2/k}\right\}.
\endalign
$$
If the condition $\(\frac u\sigma\)^{2/k}\ge ML^{3/2}
\log\frac2\sigma$ holds with a sufficiently large constant
$M$ (depending on $\bar A$), then the inequalities
$$
2^{2pL}\sigma^{-L}\exp\left\{-\alpha\(\frac{2^{p}u}{8\bar
A\sigma}\)^{2/k} \right\}\le 2^{-p} \exp\left\{-\alpha\(\frac{2^{p}u}
{10\bar A \sigma}\)^{2/k} \right\}
$$
hold for all $p=1,2,\dots$, and
$$
\sigma^{-L}\exp\left\{-\alpha\(\frac u{2\bar A\sigma}\)^{2/k}\right\}
\le\exp\left\{-\alpha\(\frac u{10\bar A\sigma}\)^{2/k}\right\}.
$$
Hence the previous estimate implies that
$$
\align
&P\(\sup_{f\in\Cal F_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|\ge
\frac u{\bar A}\) \le\sum_{p=1}^{\infty}CD 2^{-p}
\exp\left\{-\alpha\(\frac{2^{p}u}{10\bar A \sigma}\)^{2/k}
\right\}\\
&\qquad +CD\exp\left\{-\alpha\(\frac u{10\bar A
\sigma}\)^{2/k}\right\} \le 2CD \exp\left\{-\alpha
\(\frac u{10 \bar A\sigma}\)^{2/k}\right\},
\endalign
$$
and relation (12.1) holds. We have
$$
\align
n\bar\sigma^2&=2^{-4R} n\sigma^2\le
2^{-4R}\cdot2^{(4+2/k)(R+1)-2-6/k}\(\frac{u}{\bar A\sigma}\)^{2/k}=
2^{2-4/k}\cdot 2^{2R/k}\(\frac{u}{\bar A \sigma}\)^{2/k}\\
&=2^{2-4/k}\cdot \(\frac\sigma{\bar\sigma}\)^{1/k}\(\frac{u}{\bar A
\sigma}\)^{2/k}=2^{2-4/k}\cdot \(\frac{\bar\sigma}\sigma\)^{1/k}
\(\frac{u}{\bar A \bar\sigma}\)^{2/k},
\endalign
$$
hence $n\bar\sigma^2\le4\(\frac{u}{\bar A\bar\sigma}\)^{2/k}$.
Beside this, as $n\sigma^2\ge2^{(4+2/k)R-2-6/k}\(\frac{u}
{\bar A\sigma}\)^{2/k}$, $R\ge1$,
$$
n\bar\sigma^2=2^{-4R}n\sigma^2\ge
2^{-2-6/k}\cdot 2^{2R/k}\(\frac u{\bar
A\sigma}\)^{2/k}\ge\frac1{64}\(\frac u{\bar A\sigma}\)^{2/k}.
$$
It remained to show that $n\bar\sigma^2\ge
\frac{M^{2/3}(L+\beta)\log n}{1000\bar A^{4/3}}$.
This inequality clearly holds under the conditions of Proposition~12.1
if $\sigma\le n^{-1/3}$, since in this case $\log\frac2\sigma\ge
\frac{\log n}3$, and $n\bar\sigma^2\ge\frac1{64}\(\frac u {\bar
A\sigma}\)^{2/k} \ge\frac1{64}\bar A^{-2/k}
M(L+\beta)^{3/2}\log\frac2\sigma\ge
\frac1{192}\bar A^{-2/k} M(L+\beta)\log n\ge
\frac{M^{2/3}(L+\beta)\log n}{1000\bar A^{4/3}}$ if $M= M(\bar A,k)$
is chosen sufficiently large.
If $\sigma\ge n^{-1/3}$, then the inequality $2^{(4+2/k)R}\(\frac
u{\bar A\sigma}\)^{2/k} \le2^{2+6/k} n\sigma^2$ holds.
Hence $2^{-4R}\ge 2^{-4(2+6/k))/(4+2/k)} \[\dfrac{\(\frac
u{\bar A\sigma}\)^{2/k}}{n\sigma^2}\]^{4/(4+2/k)}$, and
$$
n\bar\sigma^2=2^{-4R}n\sigma^2\ge\frac{2^{-16/3}}{\bar A^{4/3}}
(n\sigma^2)^{1-\gamma}\[\(\frac u\sigma\)^{2/k}\]^\gamma
\text{ with } \gamma=\frac4{4+\frac2k}\ge\frac23.
$$
Since $n\sigma^2\ge(\frac u\sigma)^{2/k}\ge\frac M3(L+\beta)^{3/2}$,
and $n\sigma^2\ge n^{1/3}$, the above estimates yield that
$(n\sigma^2)^{1-\gamma}\[\(\frac u\sigma\)^{2/k}\]^\gamma\ge
(n\sigma^2)^{1/3}\[\(\frac u\sigma\)^{2/k}\]^{2/3}$, and
$n\bar\sigma^2\ge \frac{\bar A^{-4/3}}{50} (n\sigma^2)^{1/3}\[\(\frac
u\sigma\)^{2/k}\]^{2/3}\ge \frac{\bar A^{-4/3}}{50}n^{1/9}\(\frac
M3\)^{2/3} (L+\beta) \ge\frac{M^{2/3}(L+\beta)\log n}{1000 \bar
A^{4/3}}$.
\medskip
Now we formulate a multivariate analog of Proposition~6.2 in
Proposition~12.2 and show that Propositions~12.1 and~12.2 imply
Theorem~8.4.
\medskip\noindent
{\bf Proposition 12.2.} {\it Let us have a probability measure $\mu$
on a measurable space $(X,\Cal X)$ together with a sequence of
independent and $\mu$ distributed random variables $\xi_1,\dots,\xi_n$
and a countable $L_2$-dense class $\Cal F$ of canonical kernel
functions $f=f(x_1,\dots,x_k)$ (with respect to the measure~$\mu$)
with some parameter $D$ and exponent $L$ on the product space
$(X^k,\Cal X^k)$ such that all functions $f\in\Cal F$ satisfy
conditions (8.4) and (8.5) with some $0<\sigma\le1$, and consider the
(degenerate) $U$-statistics $I_{n,k}(f)$ with the random sequence
$\xi_1,\dots,\xi_n$ and kernel functions $f\in\Cal F$. There
exists a sufficiently large constant $K=K(k)$ together with some
numbers $\bar C=\bar C(k)>0$, $\gamma=\gamma(k)>0$ and threshold
index $A_0=A_0(k)>0$ depending only on the order $k$ of the
$U$-statistics such that if $n\sigma^2>K(L+\beta)\log n$ with
$\beta=\max\(\frac{\log D}{\log n},0\)$,
then the degenerate $U$-statistics $I_{n,k}(f)$, $f\in\Cal F$,
satisfy the inequality
$$
P\(\sup_{f\in\Cal F}|n^{-k/2}I_{n,k}(f)|\ge A n^{k/2}\sigma^{k+1}\)
\le \bar C e^{-\gamma A^{1/2k}n\sigma^2}\quad \text{if } A\ge A_0.
\tag12.4
$$
} \medskip
We shall prove formula (8.10) by applying Proposition~12.2 with the
choice $\sigma=\bar\sigma=\bar\sigma(u)$ defined in Proposition~12.1
and the classes $\Cal F=\Cal D_j$, more precisely the classes
$\Cal F=\left\{\frac{g-f_j}2\: g\in\Cal D_j\right\}$ of functions
introduced also in Proposition~12.1, where $f_j$ is the function
appearing in the definition of the class of functions $\Cal D_j$.
Clearly,
$$
\aligned
&P\(\supp_{f\in\Cal F}n^{-k/2}|I_{n,k}(f)|\ge u\)\le
P\(\sup_{f\in\Cal F_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|\ge \frac u{\bar
A}\) \\
&\qquad\qquad +\sum_{j=1}^m P\(\sup_{g\in\Cal D_j} n^{-k/2}
\left|I_{n,k}\(\frac{f_j-g}2\)\right| \ge\(\frac12-\frac1{2\bar A}\)u\),
\endaligned \tag12.5
$$
where $m$ is the cardinality of the set of functions $\Cal
F_{\bar\sigma}$ appearing in Proposition~12.1. We want to show that
if first $\bar A$ and then $M\ge M_0(\bar A)$ are chosen sufficiently
large in Proposition~12.1, then the second term at the right-hand side
of formula~(12.5) can be well bounded by means of Proposition 12.2,
and Theorem~8.4 can be proved by means of this estimate.
To carry out this program let us choose a number $\bar A_0$ in such
a way that $\bar A_0\ge A_0$ and $\gamma\bar A_0^{1/2k}\ge\frac1K$
with the numbers $A_0$, $K$ and $\gamma$ in Proposition~12.2, put
$\bar A=\max(2^{k+2}\bar A_0,2^k)$, and apply Proposition 12.1 with
this number~$\bar A$. Then by Proposition~12.1 and the choice of the
numbers $\bar A$ and $\bar A_0$ also the inequality $\(\frac
u{\bar\sigma}\) ^{2/k}\ge\frac{\bar A^{2/k}}4n
\bar\sigma^2\ge(4\bar A_0)^{2/k}n\bar\sigma^2$ holds, hence $u\ge
4\bar A_0 n^{k/2}\bar\sigma^{k+1}$ with the number $\bar\sigma$ in
Proposition~12.1. This implies that
$\(\frac12-\frac1{2\bar A}\)u\ge\frac u4\ge\bar
A_0 n^{k/2}\bar\sigma^{k+1}$, $\bar A_0\ge A_0$,
and by replacing the expression $\(\frac12-\frac1{2\bar A}\)u$
by $\bar A_0 n^{k/2}\bar\sigma^{k+1}$ in the probabilities of the sum
in the second term at the right-hand side of (12.5) we enlarge them.
The numbers $u$ considered in these estimations
satisfy the condition $n\sigma^{2/k}\ge \(\frac u\sigma\)^{2/k}
\ge M(L+\beta)^{3/2}\log\frac2\sigma$ imposed in Proposition~12.1 with
some appropriately chosen constant $M$.
Choose the number $M\ge M(\bar A,k)$ in Proposition~12.1 (which
also can be chosen as the number~$M$ in formula~(8.10) of Theorem~8.4)
in such a way that it also satisfies the inequality
$\frac{M^{2/3}(L+\beta)\log n}{1000\bar A^{4/3}}\ge K(L+\beta)\log n$
with the number $K$ appearing in the conditions of Proposition~12.2.
With such a choice the inequality $n\bar\sigma^2
\ge\frac{M^{2/3}(L+\beta)\log n}{1000\bar A^{4/3}}
\ge K(L+\beta)\log n$ holds, and Proposition 12.2 can be applied
to bound the terms in the sum at the right-hand side of (12.5). It
yields the estimate
$$
\align
&P\(\sup_{g\in\Cal D_j} n^{-k/2}
\left|I_{n,k}\(\frac{f_j-g}2\)\right| \ge\(\frac12-\frac1{2\bar
A}\)u\)\\
&\qquad\le P\(\sup_{g\in\Cal D_j}n^{-k/2}
\left|I_{n,k}\(\frac{f_j-g}2\)\right|
\ge\bar A_0n^{k/2}\bar\sigma^{k+1}\)
\le \bar Ce^{-\gamma\bar A_0^{1/2k}n\bar\sigma^2}
\endalign
$$
for all $1\le j\le m$.
(Observe that the set of functions $\frac{f_j-g}2,\;g\in\Cal D_j$ is
an $L_2$-dense class with parameter $D$ and exponent $L$.) Hence
Proposition~12.1 (relation (12.1) together with the inequality $m\le
D\bar \sigma^{-L}$) and formula 12.4 with $A=\bar A_0$ imply that
$$
P\(\supp_{f\in\Cal F} n^{-k/2}|I_{n,k}(f)|\ge u\)
\le 2CD\exp\left\{-\alpha\(\frac u{10\bar A\sigma}\)^{2/k}\right\}
+\bar CD\bar\sigma^{-L} e^{-\gamma\bar A_0^{1/2k}n\bar\sigma^2}.
\tag12.6
$$
To get the result of Theorem~8.4 from inequality (12.6) we have to
replace its second term at the right-hand side with a more appropriate
expression where, in particular, we get rid of the coefficient
$\bar\sigma^{-L}$. The condition
$n\bar\sigma^2\ge K(L+\beta)\log n$ implies that $\bar\sigma\ge
n^{-1/2}$, and by our choice of $\bar A_0$ we have $\gamma \bar
A_0^{1/2k}n\bar\sigma^2\ge \frac1Kn\bar\sigma^2 \ge L\log n\ge
2L\log\frac1{\bar \sigma}$, i.e. $\bar\sigma^{-L}\le e^{\gamma\bar
A_0^{1/2k}n\bar\sigma^2/2}$. By the estimates of Proposition~12.1
$n\bar\sigma^2 \ge\frac1{64}\(\frac u{\bar A\sigma}\)^{2/k}$. The
above relations imply that $\bar\sigma^{-L} e^{-\gamma\bar
A_0^{1/2k}n \bar\sigma^2}\le e^{-\gamma\bar
A_0^{1/2k}n\bar\sigma^2/2}\le
\exp\left\{-\frac\gamma{128} \bar A_0^{1/2k} \bar A^{-2/k}\(\frac
u\sigma\)^{2/k}\right\}$. Hence relation (12.6) yields that
$$
\align
&P\(\supp_{f\in\Cal F}n^{-k/2}|I_{n,k}(f)|\ge u\)\\
&\qquad\le 2CD\exp \left\{-\frac\alpha{(10\bar A)^2}\(\frac
u\sigma\)^{2/k}\right\} +\bar CD\exp\left\{-\frac\gamma{128}
\bar A_0^{1/2k} \bar A^{-2/k} \(\frac u\sigma\)^{2/k}\right\},
\endalign
$$
and this estimate implies Theorem~8.4.
\medskip
Thus to complete the proof of Theorem~8.4 it is enough to prove
Proposition~12.2. It turned out to be useful to apply an approach
similar to the proof of Theorem~8.3. In the proof of Theorem~8.3
first an appropriate counterpart of this result was proved, where the
$U$-statistics
were replaced by their decoupled $U$-statistics analogs defined in
formula (10.5), and then the desired result was deduced from this
estimate and Theorem~10.4. Similarly, Proposition 12.2 will be deduced
from the following result.
\medskip\noindent
{\bf Proposition $12.2'$.} {\it Let a class of functions $f\in\Cal F$
on the $k$-fold product $(X^k,\Cal X^k)$ of a measurable space
$(X,\Cal X)$, a probability measure $\mu$ on $(X,\Cal X)$ together
with a sequence of independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ satisfy the conditions of Proposition~12.2. Let
us take $k$ independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le
j\le k$, of the random sequence $\xi_1,\dots,\xi_n$, and consider the
decoupled $U$-statistics $\bar I_{n,k}(f)$, $f\in \Cal F$, defined
with their help by formula (10.5). Then there exists a sufficiently
large constant $K=K(k)$ together with some number
$\gamma=\gamma(k)>0$ and threshold index $A_0=A_0(k)>0$
depending only on the order $k$ of the decoupled $U$-statistics
we consider such that if $n\sigma^2>K(L+\beta)\log n$ with
$\beta=\max\(\frac{\log D}{\log n},0\)$,
then the (degenerate) decoupled $U$-statistics $\bar I_{n,k}(f)$,
$f\in\Cal F$, satisfy the following version of inequality (12.4):
$$
P\(\sup_{f\in\Cal F}|n^{-k/2}\bar I_{n,k}(f)|\ge A n^{k/2}\sigma^{k+1}\)
\le e^{-\gamma A^{1/2k}n\sigma^2}\quad \text{if } A\ge A_0 .\tag12.7
$$
}\medskip
It is clear that Proposition~$12.2'$ and Theorem~10.4, more explicitly
formula~$(10.8')$ in it imply Proposition 12.2. The proof of
Proposition~12.2 is based on a symmetrization argument which will be
explained in the next section. Here an important ingredient of it will
be proved, the multivariate version of Hoeff\-ding's inequality
formulated in Theorem~3.4.
\medskip\noindent
{\bf Theorem 12.3. The multivariate version of Hoeffding's inequality.}
{\it Let $\e_1$,\dots, $\e_n$ be independent random variables,
$P(\e_j=1)=P(\e_j=-1)=\frac12$, $1\le j\le n$. Fix a positive
integer~$k$, and define the random variable
$$
Z=\sum\Sb (j_1,\dots, j_k)\: 1\le j_l\le n \text{ for all } 1\le l\le
k\\ j_l\neq j_{l'} \text{ if }l\neq l' \endSb a(j_1,\dots, j_k)
\e_{j_1}\cdots \e_{j_k} \tag12.8
$$
with the help of some real numbers $a(j_1,\dots,j_k)$ which are given
for all sets of indices such that $1\le j_l\le n$, $1\le l\le k$, and
$j_l\neq j_{l'}$ if $l\neq l'$. Put
$$
S^2=\sum\Sb (j_1,\dots, j_k)\: 1\le j_l\le n \text{ for all } 1\le l\le
k\\ j_l\neq j_{l'} \text{ if }l\neq l' \endSb a^2(j_1,\dots, j_k)
\tag12.9
$$
Then
$$
P(|Z|>u)\le C \exp\left\{-B\(\frac uS\)^{2/k}\right\} \quad\text{for
all }u\ge 0 \tag12.10
$$
with some constants $B>0$ and $C>0$ depending only on the parameter
$k$. Relation (12.10) holds for instance with the choice
$B=\frac k{2e(k!)^{1/k}}$ and $C=e^k$.}
\medskip\noindent
{\it Proof of Theorem 12.3.}\/ We get with the help of formula (10.4)
(which is a consequence of Borell's inequality) that
$$
E|Z|^q\le(q-1)^{kq/2} \(EZ^2\)^{q/2} \le q^{kq/2} \(EZ^2\)^{q/2}=
q^{kq/2} \bar S^q
$$
with
$$
\bar S^2=\sum_{1\le j_1

__0$, then
$P(k!n^{-k/2}|I_{n,k}(f)|>u)\le A\exp\left\{-\frac{1-C\e^{1/k}}2
\(\frac u\sigma\)^{2/k}\right\}$ with some universal constants $A>0$
and $C>0$ depending only on the order~$k$ of the $U$-statistic
$I_{n,k}(f)$. This result is very similar to Theorem~8.3. Both
theorems yield an estimate on the probability
$P(k!n^{-k/2}|I_{n,k}(f)|>u)$ for $0\le u\le n^{k/2}\sigma^{k+1}$,
but in the present result we also get a good estimate on the constant
$\alpha$ in formula (8.9) for $0\le u\le \e n^{k/2}\sigma^{k+1}$.
At first sight this additional result does not seem an essential
improvement, but actually it expresses an important property of the
estimate (16.1). To understand this it is worth while to compare
Theorem~16.1 with Bernstein's inequality formulated in Theorem~3.1.
Theorem~3.1 implies the estimate
$$
P(n^{-1/2}|I_{n,1}(f)|>u)\le2e^{-Cu^2/\sigma^2}
\quad\text{if } 0\le u\le n\sigma^2 \tag16.2
$$
for the degenerate $U$-statistic $I_{n,1}(f)$ of order 1 with a
kernel function $f$, (i.e. for a sum of iid. random variables
$Ef(\xi_1)=0$) if the relations $\sup |f(x)|\le 1$ and $Ef(\xi_j)=0$
and $Ef^2(\xi_j)\le\sigma^2$ hold. Beside this, relation~(16.2)
also holds with a constant of the form $C=\frac{1-O(\e)}2$ if
$0\le u\le\e n\sigma^2$. On the other hand, Example~3.2 shows an
example (formulated with a different normalization) with a function
$f$ and a sequence of iid.\ random variables $\xi_1,\xi_2,\dots$
satisfying the above conditions such that
$$
P(n^{-1/2}I_{n,1}(f)>u)\ge A\exp\left\{-B\(\frac
u\sigma\)^2\cdot\frac{\sqrt n\sigma^2}u \log\frac u{\sqrt
n\sigma^2}\right\}
$$
if $u\gg n\sigma^2$. This means that in the
special case $k=1$ the probability $P(n^{-1/2}|I_{n,1}(f)|>u)$ has
a Gaussian type estimate for $0\le u\le\const n\sigma^2$, and
such an estimate does not hold for $u\gg n\sigma^2$. Beside this,
in the smaller interval $0\le u\le\e n\sigma^2$ we can say more. In
this case the relation (16.2) holds with such a constant $C$ which
almost agrees with the number $\frac12$, i.e. the upper
bound we get for $k=1$ almost agrees with the quantity suggested
by a formal application of the central limit theorem.
I want to explain that Theorem~16.1 states a similar result for
degenerate $U$-statistics of any order~$k\ge1$. To understand this
let us first recall that a sequence of normalized degenerate
$U$-statics $ n^{-k/2}I_{n,k}(f)$, $n=1,2,\dots$, defined with the
help of a sequence of iid.\ random variables $\xi_1,\xi_2,\dots$
taking values on some measurable space $(X,\Cal X)$ with
distribution $\mu$ and a function $f(x_1,\dots,x_k)$ of $k$ variables
canonical with respect to $\mu$ and such that
$$
\sigma^2=\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty
$$
has a limit distribution as $n\to\infty$. Moreover, this limit
can be expressed explicitly as the distribution of the
Wiener--It\^o integral
$$
Z_{\mu,k}(f)=\frac1{k!}\int f(x_1,\dots,x_k)\mu_W(\,dx_1)\dots
\mu_W(dx_k), \tag16.3
$$
where $\mu_W$ is the white noise with counting measure $\mu$, i.e.\
$\mu_W(A)$, $A\in \Cal X$, is a Gaussian field indexed by the
measurable subsets of the space $X$ such that $E\mu_W(A)=0$
and $E\mu_W(A)\mu_W(B)=\mu(A\cap B)$ for all $A,B\in\Cal X$. (The
definition of Wiener--It\^o integrals can be found e.g. in~[17].)
Hence it is natural to expect that in the estimates about the
distribution of degenerate $U$-statistics
the distributions of Wiener--It\^o integrals play a role similar to
the Gaussian distributions in the case $k=1$. Therefore we are
interested in good estimates on the distribution of Wiener--It\^o
integrals. The next result supplies such an estimate. As Theorem~16.1
was an improvement of Theorem~8.3, the next result is an improvement
of the first estimate in~Theorem~8.5 presented in formula~(8.11).
\medskip\noindent
{\bf Theorem 16.2.} {\it Let us consider a $\sigma$-finite measure
$\mu$ on a measurable space together with a white noise $\mu_W$
with counting measure $\mu$. Let us have a real-valued function
$f(x_1,\dots,x_k)$ on the space $(X^k,\Cal X^k)$ which satisfies
relation (8.2). Take the random integral $Z_{\mu,k}(f)$
introduced in formula (16.3). This random integral satisfies the
inequality
$$
P(k!|Z_{\mu,k}(f)|>u)\le C \exp\left\{-\frac12\(\frac
u\sigma\)^{2/k}\right\}\quad \text{for all } u>0 \tag16.4
$$
with an appropriate constant $C=C(k)>0$ depending only on the
multiplicity $k$ of the integral.}
\medskip
In Theorem 16.2 we gave only an upper bound for the distribution of
Wiener--It\^o integrals. The following example shows that there are
cases when this estimate is essentially sharp.
\medskip\noindent
{\bf Example 16.3.} {\it Let us have a $\sigma$-finite measure $\mu$
on some measure space $(X,\Cal X)$ together with a white noise $\mu_W$
on $(X,\Cal X)$ with counting measure~$\mu$. Let $f_0(x)$ be a real
valued function on $(X,\Cal X)$ such that $\int f_0(x)^2\mu(\,dx)=1$,
and take the function $f(x_1,\dots,x_k)=\sigma f_0(x_1)\cdots f_0(x_k)$
with some number $\sigma>0$ and the Wiener--It\^o integral
$Z_{\mu,k}(f)$ introduced in formula (16.3).
Then the relation
$\int f(x_1,\dots,x_k)^2\,\mu(\,dx_1)\dots\,\mu(\,dx_k)=\sigma^2$
holds, and the random integral $Z_{\mu,k}(f)$ satisfies the inequality
$$
P(k!|Z_{\mu,k}(f)|>u)\ge \frac{\bar C}{\(\frac u\sigma\)^{1/k}+1}
\exp\left\{-\frac12\(\frac u\sigma\)^{2/k}\right\}\quad
\text{for all } u>0 \tag16.5
$$
with some constant $\bar C>0$.}
\medskip\noindent
{\it Proof of the statement of Example 16.3.}\/ We may restrict our
attention to the case $k\ge2$. It\^o's formula (see e.g.~[17]) states
that the random variable $k!\bar Z_{\mu,k}(f)$ can be expressed as
$k!Z_{\mu,k}(f)=\sigma H_k\(\int f_0(x)\mu_W(\,dx)\)=\sigma
H_k(\eta)$, where $H_k(x)$ is the $k$-th Hermite polynomial with leading
coefficient~1, and $\eta=\int f_0(x)\mu_W(\,dx)$ is a standard normal
random variable. Hence we get by exploiting that the coefficient of
$x^{k-1}$ in the polynomial $H_k(x)$ is zero that
$P(k!|Z_{\mu,k}(f)|>u)=P(|H_k(\eta)| \ge\frac u\sigma)\ge
P\(|\eta^k|-D|\eta^{k-2}|>\frac u\sigma\)$ with a sufficiently large
constant $D>0$ if $\frac u\sigma>1$. There exist such positive
constants $A$ and $B$ that $P\(|\eta^k|-D|\eta^{k-2}|>\frac u\sigma\)
\ge P\(|\eta^k|>\frac u\sigma+A\(\frac u\sigma\)^{(k-2)/k}\)$ if
$\frac u\sigma>B$.
Hence
$$
P(k!|Z_{\mu,k}(f)|>u)\ge
P\(|\eta|>\(\frac u\sigma\)^{1/k}\(1+A\(\frac u\sigma\)^{-2/k}\)\)
\ge \frac{\bar C \exp\left\{-\frac12\(\frac u\sigma\)^{2/k}\right\}}
{\(\frac u\sigma\)^{1/k}+1}
$$
with an appropriate $\bar C>0$ if $\frac u\sigma>B$. Since
$P(k!|Z_{\mu,k}(f)|>0)>0$, the above inequality also holds
for $0\le \frac u\sigma\le B$ if the constant $\bar C>0$ is chosen
sufficiently small. This means that relation (16.5) holds.
Let us remark that if $f(x_1,\dots,x_k)=\sigma f_0(x_1)\dots
f_0(x_k)$ is a function on the space $(X^k,\Cal X^k)$ such that
$\int f_0(x)\mu(\,dx)=0$, $\int f_0^2(x)\mu(\,dx)=1$,
$\sup |f_0(x)|\le 1$, $0<\sigma\le1$, and we have a sequence of iid.
random variables, $\xi_1,\xi_2,\dots$ with distribution $\mu$, then
the $U$-statistics $I_{n,k}(f)$, $n=1,2,\dots$, are degenerate, and
they satisfy the conditions of Theorem~16.1. Beside this, they
converge in distribution to the Wiener--It\^o integral
$Z_{\mu,k}(f)$ as $n\to\infty$ which satisfies the conditions of
example (16.3). Hence the $U$-statistics $I_{n,k}(f)$ satisfy
relation (16.1), and also the inequality
$$
\lim_{n\to\infty} P(k!n^{-k/2}|I_{n,k}(f)|>u)\ge
\frac{\bar C\exp\left\{-\frac12\(\frac u\sigma\)^{2/k}\right\}}
{\(\frac u\sigma\)^{1/k}+1}
$$
holds with an appropriate $\bar C>0$ if $\frac u\sigma>B$. This
means that for not too large values of $u$, more explicitly if
$u\le\e n^{k/2}\sigma^{k+1}$ with a small number $\e>0$, the estimate
given in Theorem~16.1 is essentially sharp. Let me also remark that
Example~8.6 shows such a degenerate $U$-statistic of order $k=2$
for which an estimate similar to that of Theorem~8.3 cannot hold
for $n\gg n^{k/2}\sigma^{k+1}$. We have presented such an example
only for $k=2$, but similar examples can be given for all~$k\ge1$.
This means that Theorem~16.1 shows a similar picture about the
distribution of degenerate $U$-statistics of order~$k$ for all
$k\ge1$ as Bernstein's inequality shows in the case $k=1$. We have
a good estimate on the distribution $P(n^{-k/2}I_{n,k}(f)>u)$ of a
degenerate $U$-statistic with a kernel function $f$ satisfying
relations (8.1) and (8.2) in the domain $0\le u\le
n^{k/2}\sigma^{k+1}$. Such an estimate is already proved in
Theorem~8.3, but Theorem~16.1 says more in an interval of the
form $0\le u\le \e n^{k/2}\sigma^{k+1}$ with a small $\e>0$. The
limit theorems about degenerate $U$-statistics give an upper bound
for the coefficient $\alpha$ in the exponent of formula (8.9) in
Theorem~8.3, and Theorem~16.1 states that the estimate (8.9) holds
with an almost as good coefficient $\alpha$ in the interval
$0\le u\le\e n^{k/2}\sigma^{k+1}$ as this upper bound suggests.
The proof of the above results are based, similarly to the proof of
Theorems~8.3 and~8.5, on some good estimates on high moments of
degenerate $U$-statistics $I_{n,k}(f)$ and of Wiener--It\^o integrals
$Z_{n,k}(f)$. The result of Theorem~16.2 follows from the following
\medskip\noindent
{\bf Proposition 16.4.} {\it Let the conditions of Theorem~16.2 be
satisfied for a multiple Wiener--It\^o integral $Z_{\mu,k}(f)$
of order~$k$. Then, with the notations of Theorem~16.2, the inequality
$$
E\(k!|Z_{\mu,k}(f)|\)^{2M}\le 1\cdot3\cdot5\cdots
(2kM-1)\sigma^{2M}\quad\text {for all }M=1,2,\dots \tag16.6
$$
holds.}\medskip
By the Stirling formula Proposition~16.4 implies that
$$
E(k!|Z_{\mu,k}(f)|)^{2M}\le \frac{(2kM)!}{2^{kM}(kM)!}\sigma^{2M}\le
A\(\frac2e\)^{kM}(kM)^{kM}\sigma^{2M} \tag16.7
$$
for all numbers $A>\sqrt2$ if $M\ge M_0=M_0(A)$. The following
Proposition~16.5 states a similar, but weaker inequality for the
moments of normalized degenerate $U$-statistics.
\medskip\noindent
{\bf Proposition 16.5.} {\it Let us consider a degenerate
$U$-statistic $I_{n,k}(f)$ of order $k$ with sample size $n$ and
with a kernel function $f$ satisfying relations (8.1) and (8.2) with
some $0<\sigma^2\le1$. Fix a positive number $\eta>0$. There exists
some universal constants $A=A(k)>\sqrt2$, $C=C(k)>0$ and
$M_0=M_0(k)\ge1$ depending only on the order of the $U$-statistic
$I_{n,k}(f)$ such that
$$
\aligned
E\(n^{-k/2}k!I_{n,k}(f)\)^{2M}&\le A\(1+C\sqrt\eta\)^{2kM}
\(\frac2e\)^{kM}\(kM\)^{kM}\sigma^{2M} \\
&\qquad \text{for all integers } M \text{ such that } kM_0\le kM\le
\eta n\sigma^2.
\endaligned \tag16.8
$$
The constant $C=C(k)$ in formula (2.3) can be chosen e.g. as
$C=2\sqrt2$ which does not depend on the order $k$ of the
$U$-statistic $I_{n,k}(f)$.}\medskip
Let us remark that formula (16.6) can be reformulated as
$E(k!|Z_{\mu,k}(f)|)^{2M}\le E(\sigma\eta^k)^{2M}$, where $\eta$
is a standard normal random variable. Theorem~16.2 states that
the tail distribution of $k!|Z_{\mu,k}(f)|$ satisfies an
estimate similar to that of $\sigma|\eta|^k$. This follows simply
from Proposition~16.4 and the Markov inequality
$P(k!|Z_{\mu,k}(f)|>u)\le \frac{E(k!|Z_{\mu,k}(f)|)^{2M}}{u^{2M}}$
with an appropriate choice of the parameter~$M$.
Proposition 16.5 states that in the case $M_0\le M\le\e n\sigma^2$
the inequality
$$
E\(n^{-k/2}k!I_{n,k}(f)\)^{2M}\le E((1+\beta(\e))\sigma\eta^k)^{2M}
$$
holds with a standard normal random variable $\eta$ and a function
$\beta(\e)$, $0\le\e\le1$, such that $\beta(\e)\to0$ if $\e\to0$,
and $\beta(\e)\le C$ with some universal constant $C=C(k)$ if
$0\le\e\le1$. This means that certain high but not too high moments
of $n^{-k/2}k!I_{n,k}(f)$ behave similarly to the moments of
$k!Z_{\mu,k}(f)$. As a consequence, we can prove a similar, but
slightly weaker estimate for the distribution of
$n^{-k/2}k!I_{n,k}(f)$ as for the distribution of
$k!Z_{\mu,k}(f)$. Actually this is done in the proof of
Theorem~16.1.
Estimate (16.8) is very similar to the bound (10.1) formulated in
Proposition~(10.1). The main difference is that here we get the
estimate
$$
E\(n^{-k/2}k!I_{n,k}(f)\)^{2M}\le C^M (kM)^{kM}\sigma^{2M} \tag16.9
$$
with a good constant $C$, at least if $M\le\e n\sigma^2$ with a
small number $\e>0$. The method of proof of Theorem~8.3 presented
in this paper cannot yield such a good estimate. The main problem
with this method is that it applies a symmetrization argument
(this is done in the proof of the Marcinkiewicz--Zygmund inequality),
in which we bound the moments of the random variable we are
investigating by the moments of a random variable with constant times
larger variance. Such a step in the proof does not allow to get the
estimate~(16.9) with a good constant~$C>0$.
On the other hand, the estimation of the moments of a degenerate
$U$-statistics by means of the diagram formula yields a better
estimate of the moments. The idea behind this approach is that in
calculating the even moments $E\(I_{n,k}(f)\)^{2M}$ of a degenerate
$U$-statistics by means of the diagram formula we have to work with
some terms which also appear in the calculation of the moments
$E(Z_{\mu,k}(f))^{2M}$ of the Wiener--It\^o integral~$Z_{\mu,k}(f)$,
but we also have to handle some additional terms. It must be checked
that the contribution of these additional terms is not too large.
This is the case if $M\le n\sigma^2$ with $\sigma^2=\int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)$, and an even better
estimate can be given about the contribution of these terms if
$M\ge\e n\sigma^2$ with a small~$\e>0$.
Let me finally remark that the above method can also give an
improvement of the multivariate version of the Hoeffding inequality
(Theorem 12.3). The proof of the following inequality can be found
in~[22].
\medskip\noindent
{\bf Theorem 16.6. The multivariate version of Hoeffding's
inequality.} {\it Let $\e_1$,\dots, $\e_n$ be independent random
variables, $P(\e_j=1)=P(\e_j=-1)=\frac12$, $1\le j\le n$. Fix a
positive integer~$k$, and define the random variable
$$
Z=\sum\Sb (j_1,\dots, j_k)\: 1\le j_l\le n \text{ for all } 1\le l\le
k\\ j_l\neq j_{l'} \text{ if }l\neq l' \endSb a(j_1,\dots, j_k)
\e_{j_1}\cdots \e_{j_k} \tag16.10
$$
with the help of some real numbers $a(j_1,\dots,j_k)$ which are given
for all sets of indices such that $1\le j_l\le n$, $1\le l\le k$, and
$j_l\neq j_{l'}$ if $l\neq l'$. Put
$$
S^2=\sum\Sb (j_1,\dots, j_k)\: 1\le j_l\le n \text{ for all } 1\le l\le
k\\ j_l\neq j_{l'} \text{ if }l\neq l' \endSb a^2(j_1,\dots, j_k)
\tag16.11
$$
Then
$$
P(k!|Z|>u)\le C \exp\left\{-\frac12\(\frac uS\)^{2/k}\right\}
\quad\text{for all }u\ge 0 \tag16.12
$$
with some constant $C>0$ depending only on the parameter
$k$.}
\medskip\noindent
We may assume that the coefficients $a(j_1,\dots,j_k)$ in formulas
(16.10) and (16.11) are symmetric functions of their arguments,
i.e.\ $a(j_1,\dots,j_k)=a(j_{\pi(1)},\dots,j_{\pi(k)})$ for all
permutations $\pi\in\Pi_k$ of the set $\{1,\dots,k\}$. If these
coefficients $a(j_1,\dots,j_k)$ do not have not this symmetry
property, then we can replace them with their symmetrizations
$a_{\text{Sym}}(j_1,\dots,j_k)=\frac1{k!}\summ_{\pi\in \Pi_k}
a(j_{\pi(1)},\dots,j_{\pi(k)}$. In such a way we do not modify the
value of the random variable~$Z$, and decrease the value of the
number $S^2$. With such a choice of the coefficients we have
$EZ=0$ and $\Var Z=k!S^2$.
The main advantage of this result with respect to Theorem~12.3
is that formula (16.12) holds with the right constant in the
exponent at the right-hand side. The proof is based on good
moment estimates of the random variable $Z$ defined in (16.10). I
formulate this result which may be interesting in itself.
\medskip\noindent
{\bf Theorem 16.7} {\it The random variable $Z$ defined in formula
(16.10) satisfies the inequality
$$
EZ^{2M}\le 1\cdot3\cdot5\cdots(2kM-1)S^{2M}\quad\text{for all }
M=1,2,\dots \tag16.13
$$
with the constant $S$ defined in formula (16.11).}
\medskip
It is worth while to compare formula (16.13) with the estimate that
Borell's inequality yields for this problem. By applying Borell's
inequality with the choice $q=2$ and $p=2M$ we get that $EZ^{2M}\le
(2M-1)^{kM}E(Z^2)^M=(2M-1)^{kM}\(k!S^2\)M$. Since $(2M-1)^{2M}=
(2M)^{kM}\(1-\frac1{2M}\)^{kM}\sim e^{-k/2} (2M)^{kM}$ for large
values~$M$, hence Borell's inequality yields the inequality
$EZ^{2M}\le\const(2M)^{kM}S^{2M}\cdot (k!)^M$ for large
exponents~$M$. On the other hand, Theorem~16.7 together with the
Stirling formula yield the estimate $EZ^{2M}\le\const
(2M)^{kM}S^{2M}\cdot\(\frac ke\)^{kM}$. It can be seen that
$k!>\(\frac ke\)^k$ for all $k\ge1$. This means that Theorem~16.7
yields an improvement of the Borell's inequality in the special
case discussed above. This estimate is only a special case of
Borell's inequality, but this is its most important special case.
\beginsection 17. An overview of the results in this work
I discuss briefly the problems investigated in this work and
recall some basic results related to them. I also give a list
of works where they can be found. Beside this, I discuss
some background problems and results which may explain the
motivation for the study presented here.
I met the main problem considered in this work when tried to
adapt the method of proof of the central limit theorem for
maximum-likelihood estimates to some more difficult questions about
so-called non-parametric maximum likelihood estimate problems.
The Kaplan--Meyer estimate for the empirical distribution function
with the help of censored data investigated in the second section
is such a problem. It is not a maximum-likelihood estimate in the
classical sense, but it can be considered as a non-parametric
maximum likelihood estimate. Indeed, since in the estimation of a
distribution function with the help of censored data the class of
possible candidates for being the distribution function we are
looking for is too large, there is no dominating measure with
respect to which all of them have a density function. As a
consequence, the classical principle of the maximum-likelihood
estimate cannot be applied in this case. A natural way to overcome
this difficulty is to choose a smaller class of distribution
functions, to compare the probability of the appearance of the sample
we observe with respect to all distribution functions of this class
and to choose that distribution function as our estimate for which
this probability takes its maximum. The Kaplan--Meyer estimate can
be found on the basis of this principle in the following way: Let
us estimate the distribution function $F(x)$ of the censored
data simultaneously with the distribution function $G(x)$ of the
censoring data. (We have a sample of size $n$ and know which
sample elements are censored and which are censoring data.) Let us
consider the class of such pairs of estimates $(F_n(x),G_n(x))$ of
the pair $(F(x),G(x))$ for which the distribution function $F_n(x)$
is concentrated in the censored sample points and the distribution
function $G_n(x)$ is concentrated in the censoring sample points;
more precisely, let us also assume that if the largest sample point
is a censored point, then the distribution function $G_n(x)$ of the
censoring data takes still another value which is larger than any
sample point, and if it is a censoring point then the distribution
function $F_n(x)$ of the censored data takes still another value
larger than any sample point. (This modification at the end of the
definition is needed, since if the largest sample points is from the
class of censored data, then the distribution $G(x)$ of the
censoring data in this point must be strictly less than~1, and if
it is from the class of censoring data, then the value of the
distribution function $F(x)$ of the censored data must be strictly
less than~1 in this point.) Let us take this class of pairs of
distribution functions $(F_n(x),G_n(x))$, and let us choose that
pair of distribution functions of this class as the (non-parametric
maximum likelihood) estimate with respect to which our observation
has the greatest probability.
The above extremal problem for the pairs of distribution functions
$(F_n(x),G_n(x))$ can be solved explicitly, and it yields the
estimate of $F_n(x)$ written down in formula~(2.3). (The function
$G_n(x)$ satisfies a similar relation, only the random variables $X_j$
and $Y_j$ and the events $\delta_j=1$ and $\delta_j=0$ have to be
replaced in it.) Then, as I have indicated, a natural analog of the
linearization procedure in the maximum likelihood estimate also works
in this case, and there is only one really hard part of the proof.
We need a good estimate on the distribution of the integral of a
function of two variables with respect to the product of a normalized
empirical measure with itself. Moreover, we also need a good
estimate on the distribution of the supremum of a class of integrals,
when the elements of an appropriate class of functions are integrated
with respect to the above product measure. The main subject of this
work is to solve the above problems in a more general setting, when
not only two-fold, but also $k$-fold integrals are considered with
arbitrary number~$k\ge1$.
The proof of this work for the limit behaviour of the Kaplan--Meyer
estimate applied the explicit form of this estimate. It would be
interesting to find such a modification of this proof which exploits
that the Kaplan--Meyer estimate is the solution of an appropriate
extremal problem. We may expect that such a proof can be generalized
to a general result about the limit behaviour for a wide class of
non-parametric maximum likelihood estimates. Such a consideration is
behind the remark of Richard Gill I quoted at the end of Section~2.
I hope that such a program can be realized, but at the present time
I cannot do this.
A detailed proof together with a sharp estimate on the speed of
convergence for the limit behaviour of the Kaplan--Meyer
estimate based on the ideas presented in Section~2 is given
in paper [24]. Paper [25] explains more about its background, and it
also discusses the solution of some other non-parametric maximum
likelihood problems. The results about multiple integrals with
respect to a normalized empirical distribution function needed in
these works were proved in~[17]. The results of~[18] are completely
satisfactory for the study in~[24], but they also have some drawbacks.
They do not show that if the random integrals we are considering have
small variances, then they satisfy better estimates. Beside this, if
we consider the supremum of random integrals of an appropriate class
of functions, then these results can be applied only in very
special cases. Moreover, the method of proof of~[18] did not allow a
real generalization of its results, hence I had to find a
different approach when tried to generalize them.
I do not know of other works where the distribution of multiple
random integrals with respect to a normalized empirical distribution
is studied. On the other hand, there are some works where the
distribution of (degenerate) $U$-statistics is investigated. The
most important results obtained in this field are contained in the
book of de la Pe\~na and Gin\'e {\it Decoupling, From Dependence to
Independence}\/~[6]. The problems about the behaviour of degenerate
$U$-statistics and multiple integrals with respect to a normalized
empirical distribution function are closely related, but the
explanation of their relation is far from trivial. I return to
this question later.
Even the study of the one-dimensional version of the problems studied
here, i.e. the description of the behaviour of one-fold integrals or
classes of one-fold integrals contains several hard problems which
have to be investigated closely to have a good understanding of the
subject. In the one-dimensional case it is fairly simple to prove
that the problems about the behaviour of one-fold integrals with
respect to a normalized empirical measure and about the behaviour of
normalized sums of independent random variables are equivalent. I
start this work with the description of the case of (classes of)
one-fold integrals or of sums of independent random variables. This
question has a fairly big literature. I would mention first of all the
books {\it A course on empirical processes}\/~[9], {\it Real Analysis
and Probability}\/~[10] and {\it Uniform Central Limit Theorems}\/~[11]
written by R.~M.~Dudley. These books contain a much more detailed
description of the empirical processes than the present work together
with a lot of interesting results.
The first problem studied here deals with the tail behaviour of sums
of independent and bounded random variables with expectation zero.
This question is considered in Section~3 where the proof of two
already classical results, that of Bernstein's and Bennett's
inequalities is explained. (These results are proved e.g.~[4]).
We are also interested in the question when these results
give an estimate suggested by the central limit theorem. Bernstein's
inequality provides such an estimate if the variance of the sum is
not too small. (The results in Section~3 tell explicitly when this
variance should be considered too small.) If the variance of the
sum is too small, then Bennett's inequality provides a slight
improvement of Bernstein's inequality. On the other hand, Example~3.2
shows that in the unpleasant case when this variance is too small
Bennett's inequality is essentially sharp. I inserted this example
to the text, because it may help to understand better the content of
Bernstein's and Bennett's inequality. I have not found similar
examples in the literature.
The estimate on the distribution of a sum of independent random
variables if this sum has a small variance is weak because of the
following reason. In this case the probability that the sum will be
larger than a given value may be much larger than the (rather small)
value suggested by the central limit theorem because of the appearance
of some irregularities with relatively large probability. The hardest
problems we have to cope with in the solution of the problems of this
work are closely related to the weak estimates for sums of independent
random variables if the variance of the sums are small and to the
weak estimates in some similar problems. The weakness of these
estimates imply that in the study of the problems we are interested
in the method of proof for their Gaussian counterpart cannot be
adapted completely, some new ideas are needed. We have overcome
this difficulty by applying a symmetrization argument. The last
result of Section~3, Hoeff\-ding's inequality presented in
Theorem~3.4 is an important ingredient of this symmetrization
argument. It is also a classical result whose proof can be found for
instance in~[15].
In Section~4 I formulated the one-variate version of our main result
about the supremum of the integrals of a class $\Cal F$ of functions
with respect to a normalized empirical measure together with an
equivalent statement about the distribution of the supremum of a class
of random sums $\summ_{j=1}^nf(\xi_j)$ defined with the help of a
sequence of i.i.d. random variables $\xi_1,\dots,\xi_n$ and a class
of functions $f\in\Cal F$ satisfying some appropriate conditions.
These results are given in Theorems~4.1 and~$4.1'$. Also a Gaussian
version of them is presented in Theorem~4.2 about the distribution of
the supremum of a Gaussian field with some appropriate properties.
In the above mentioned results we have imposed the condition that
the class of functions~$\Cal F$ or what is equivalent the set of
random variables whose supremum we estimate is countable. In the
proofs this condition is really exploited. On the other hand, in
some important applications we also need results about the
supremum of a possibly non-countable set of random variables.
Hence I introduced the notion of countably approximable
classes of random variables and proved that in the results of this
work the condition about countability can be replaced by the weaker
condition that the class of random variables whose supremum is taken
is countably approximable. R.~M.~Dudley worked out a different method
to handle the supremum of possibly non-countably many random variables,
and generally his method is applied in the literature. The relation
between these two methods deserves some discussion.
Let us first recall that if a class of random variables $S_t$,
$t\in T$, indexed by some index set $T$ is given, then a set $A$ can
be measurable with respect to the $\sigma$-algebra generated by the
random variables $S_t$, $t\in T$, only if there exists a countable
subset $T'=T'(A)\subset T$ such that the set $A$ is also measurable
with respect to the smaller $\sigma$-algebra generated by the random
variable $S_t$, $t\in T'$. Beside this, if the finite dimensional
distributions of the random variables $S_t$, $t\in T$, are given,
then by the results of classical measure theory also the probability
of the events measurable with respect to the $\sigma$-algebra
generated by these random variables $S_t$, $t\in T$, is determined.
But there are rather few other events whose probabilities are
determined by the finite dimensional distributions of the random
variables~$S_t$, $t\in T$. On the other hand, if $T$ is a
non-countable set, then the events $\left\{\supp_{t\in T}S_t>u\right\}$
are not measurable with respect to the above $\sigma$-algebra, hence
generally we cannot speak of their probabilities. To overcome this
difficulty Dudley worked out a theory which enabled him to work also
with outer measures. His theory is based on some rather deep results
of the analysis. It can be found for instance in his book~[11].
I restricted my attention to the case when after the completion of
the probability measure $P$ we can also speak of the real (and not
only outer) probabilities $P\(\supp_{t\in T}S_t>u\)$. I tried to
find appropriate conditions under which these probabilities really
exist. More explicitly, we are interested in the case when for all
$u>0$ there exists some set $A=A_u$ measurable with respect to the
$\sigma$-algebra generated by the random variables $S_t$, $t\in T$,
such that the symmetric difference of the sets $A_u$ and
$\left\{\supp_{t\in T}S_t>u\right\}$ is contained in a set
measurable with respect to the $\sigma$-algebra generated by the
random variables $S_t$, $t\in T$, and has probability zero. In
such a case we can define also the probability $P\(\supp_{t\in
T}S_t>u\)$ as $P(A_u)$. This approach led me to the definition of
countable approximable classes of random variables. Its validity
enables us to speak about the probability of the event that
the supremum of the random variables we are interested in is
larger than some fixed value. I also proved a simple but
useful result in Lemma~4.3, which provides a condition for the
validity of this property.
The problem we met here is not an abstract, technical difficulty.
Indeed, the distribution of such a supremum can become different if
we modify each random variable at a set of probability zero,
although the joint distribution of the random variables we consider
remains the same after such an operation. Hence, if we are interested
in the supremum of a non-countable set of random variables with
described joint distribution we have to describe more explicitly
which version of this set of random variables we consider. It
is natural to look for such an appropriate version of the random
field $S_t$, $t\in T$, whose `trajectories' $S_t(\oo)$, $t\in T$,
have nice properties for all elementary events $\oo\in\Omega$.
Lemma~4.3 can be interpreted as a result in this spirit. The
condition given for the countable approximability of a class of
random variables at the end of this lemma can be considered as a
smoothness condition about the `trajectories' of the random field we
consider. This approach shows some analogy to some important problems
in the theory of stochastic process when a regular version of a
stochastic process is considered and the smoothness properties of its
trajectories are investigated.
In our problems the version of the set of random variables $S_t$,
$t\in T$, we shall work with appears in a simple and natural
way. In these problems we have finitely many random variables
$\xi_1,\dots,\xi_n$ at the start, and all random variables
$S_t(\oo)$, $t\in T$, we are considering can be defined individually
for each $\oo$ as a functional of these random variables
$\xi_1(\oo),\dots,\xi_n(\oo)$. We take the version of the random
field $S_t(\oo)$, $t\in T$, we get in such a way and want to show
that it is countably approximable. In Section~4 we have proved this
property in an important model, probably in the most important model
in possible applications we are interested in. In more complicated
situations when our random variables are defined not as a
functional of finitely many sample points, for instance in the case
when we define our set of random variables by means of integrals with
respect to a Gaussian field it is harder to find the right regular
version of our sets of random variables. In this case the integrals we
consider are defined only with probability~1, and we have to make some
extra work to find their right version. At any rate, in the problems
we are interested in our approach is satisfactory for
our purposes, and it is simpler than that of Dudley; we do not have to
follow his rather difficult technique. On the other hand, I must
admit that I do not know the precise relation between the approach of
this work and that of Dudley.
In Section~4 the notion of $L_p$-dense classes, $1\le p<\infty$, is
also introduced. The notion of $L_2$-dense classes plays an important
role in the formulation Theorems~4.1 and~$4.1'$. The notion of
$L_2$-dense classes can be considered as a version of the
$\e$-entropy discussed at many places in the literature. On the other
hand, there seems to be no unique definition of $\e$-entropy in the
literature. I introduced the term of $L_2$-dense classes, because this
seems to be the appropriate notion in the study of this work. To
apply the results related to $L_2$-dense classes we also need some
knowledge about how to check it in concrete models. For this goal I
discussed here Vapnik--\v{C}ervonenkis classes, a popular and
important notion of modern probability theory. Several books and
papers, (see e.g. the books~[11], [28],~[30] and the references in
them) deal with this subject. An important result in this field
is Sauer's lemma, (Lemma~5.2) which together with some other results,
like Lemma~5.3 imply that the classes of sets or functions are in many
several interesting models Vapnik--\v{C}ervonenkis classes.
I put these results to the Appendix, partly because they can be
found in the literature, partly because in our investigation
Vapnik--\v{C}ervonenkis classes play a different and less important
role than at other places. In our discussion Vapnik--\v{C}ervonenkis
classes are applied to show that certain classes of functions are
$L_2$-dense. A result of Dudley formulated in Lemma~5.2
implies that a Vapnik--\v{C}ervonenkis class of functions with
absolute value bounded by a fixed constant is an $L_1$, hence also an
$L_2$-dense class of functions. The proof of this important result
which seems to be less known even among experts of this subject than
it should be is contained in the main text. Dudley's original result
was formulated in the special case when the functions we consider are
indicator functions of some sets, but its proof contained all
important ideas needed in the proof of Lemma~5.2.
Theorem 4.2, which is the Gaussian counterpart of Theorems~4.1
and~$4.1'$ is proved in Section~6 by means of a natural and
important technique, called the chaining argument. We apply an
inductive procedure, during which an appropriate sequence of finite
subsets of our set of random variables is defined, and try to give a
good estimate on the supremum of these subsets of our random
variables. The subsets we consider are denser and denser subsets of
the original set of random variables, and if they are constructed in
a clever way, then we get the result we want to prove by means of a
limiting procedure. In such a way we get a relatively simple proof of
Theorem 4.2, but this method is not strong enough to supply a complete
proof of Theorem~4.1. The cause of the weakness of the method in this
case is that we cannot give a good estimate on the probability that a
sum of independent random variables is greater than a prescribed value
if these random variables have too small variances. The chaining
argument supplies a result much weaker than that what we want to
prove under the conditions of Theorem~4.1. Lemma~6.1 contains the
result the chaining argument yields under the conditions of
Theorem~4.1. In Section~6 still another result, Lemma~6.2 is
formulated, and it is also shown that Lemmas~6.1 and~6.2 together
imply Theorem~4.1. The proof is not difficult, despite of some
non-attractive details. We have to check that the parameters in
Lemmas~6.1 and~6.2 can be fitted to each other.
Lemma~6.2 is proved in Section~7. It is based on a symmetrization
argument. This proof applies the ideas of a paper of Kenneth
Alexander~[1], and although its presentation is essentially
different of Alexander's approach, it can be considered as a
version of his proof.
A similar problem should also be mentioned at this place.
M.~Talagrand wrote a series of papers about concentration
inequalities, and this research was also continued by some other
authors. I would mention the works of M.~Ledoux~[16] and
P.~Massart~[26]. Concentration inequalities give a bound about
the difference of the supremum of a set of appropriately
defined random variables from its expected value; they express how
strongly this supremum is concentrate around its expected value.
Such results are closely related to Theorem~4.1, and the discussion
of their relation deserves some attention. A typical concentration
inequality is the following result of Talagrand~[29].
\medskip\noindent
{\bf Theorem 17.1. (Theorem of Talagrand.)} {\it Consider $n$
independent and identically distributed random variables
$\xi_1,\dots,\xi_n$ with values in some measurable space $(X,\Cal X)$.
Let $\Cal F$ be some countable family of real-valued measurable
functions of $(X,\Cal X)$ such that $\|f\|_\infty\le b<\infty$ for
every $f\in\Cal F$. Let $Z=\supp_{f\in\Cal F}\summ_{i=1}^n f(\xi_i)$
and $v=E(\supp_{f\in\Cal F}\summ_{i=1}^n f^2(\xi_i))$. Then for
every positive number~$x$,
$$
P(Z\ge EZ+x)\le K\exp\left\{-\frac 1{K'}\frac
xb\log\(1+\frac{xb}v\)\right\}
$$
and
$$
P(Z\ge EZ+x)\le K\exp\left\{-\frac{x^2}{2(c_1v+c_2bx)}\right\},
$$
where $K$, $K'$, $c_1$ and $c_2$ are universal positive constants.
Moreover, the same inequalities hold when replacing $Z$ by $-Z$.}
\medskip
Theorem~17.1 yields, similarly to Theorem~4.1, an estimate about
the distribution of the supremum for a class of sums of independent
random variables. It can be considered as a generalization of
Bernstein's and Bennett's inequalities when the distribution of the
supremum of partial sums is estimated. A remarkable feature of this
result is that it assumes no condition about the structure of the
class of functions $\Cal F$ (like the condition of $L_2$-dense
property of the class $\Cal F$ imposed in Theorem~4.1.) On the
other hand, the estimates in Theorem~17.1 contain the quantity
$EZ=E\(\supp_{f\in\Cal F}\summ_{i=1}^n f(\xi_i)\)$. Such an
expectation of some supremum appears in all concentration
inequalities. As a consequence, they are useful only if we can
bound the expected value of an appropriate supremum. This
is a hard question in the general case, and this is the reason why
I preferred a direct proof of Theorem~4.1 without the application
of concentration inequalities. Let me remark that the condition
$u\ge\const\sigma\log^{1/2}\frac2\sigma$ with some appropriate
constant which cannot be dropped from Theorem~4.1 is related to
the fact that the expected value of the supremum of the normalized
random sums considered in Theorem~4.1 has such a magnitude.
The main results of this work are presented in Section~8. Theorem~8.3
which contains an estimate about the distribution of a degenerate
$U$-statistic was first proved in a paper of Gin\'e and Arcones
in~[2], its equivalent version about the multiple integrals with
respect to a normalized empirical measure formulated in Theorem~8.1
in my paper~[19]. The equivalence of these two results is not
self-evident. Later I proved an improved version of Theorem~8.3 in
paper~[21]. This result is formulated in Theorem~16.1, and it is
also compared with Theorem~8.3. It is also explained that
Theorem~16.1 could be considered the multivariate version of
Bernstein's inequality with more right than Theorem~8.3. Here I
omitted its proof which applies a technique (diagram formulas
for the calculation of products of multiple random integrals or
degenerate $U$-statistics) not discussed in this work. Here
Theorem~8.3 was proved by means of a symmetrization argument.
The explanation of such a proof was simpler in the present work,
because it applies such methods which were worked out in the
investigation of other problems. On the other hand, some arguments
can be posed against such a proof. The application of symmetrization
arguments in the proof of Theorem~8.3 also has some drawbacks. In
certain problems, like the problem of Theorem~8.3, this method
cannot supply a really sharp result. Some mathematicians working in
this field seem not to be aware of this fact.
It may be interesting to mention that the problem of Theorem~8.3 has
a natural generalization worth of a closer study. We can consider
such generalized $U$-statistics in which the underlying random
variables $\xi_1,\dots,\xi_n$ are independent, but they need not
be identically distributed, and the $U$-statistic also may have a
more general form. Namely, we can take a class of kernel functions
$\bold f=\{f_{l_1,\dots,l_k}(x_1,\dots,x_k)\}$ on the space
$(X^k,\Cal X^k)$ with such an indexation that $1\le l_j\le n$,
$1\le j\le k$, and $l_j\neq l_{j'}$ if $j\neq j'$, and define with
the help of these independent random variables and class of kernel
functions the generalized $U$-statistic
$$
I_{n,k}(\bold f)=\sum\Sb 1\le l_j\le n,\; 1\le j\le k\\ l_j\neq l_{j'}
\text{ if }j\neq j'\endSb
f_{l_1,\dots,l_k}(\xi_{l_1},\dots,\xi_{l_k}). \tag17.1
$$
One can also naturally define generalized degenerate $U$-statistics.
We call a generalized $U$-statistic degenerate if for all sets of
indices $(l_1,\dots,l_k)$ in the sum (17.1) and for all $1\le j\le k$
$$
E(f_{l_1,\dots,l_k}(\xi_{l_1},\dots,\xi_{l_k})|\xi_{l_s},\;
s\in \{1\dots, k\}\setminus\{j\})\equiv0.
$$
Generalized degenerate $U$-statistics can be considered as the
natural multivariate generalizations of sums of independent random
variables, just as degenerate $U$-statistics are the natural
multivariate generalizations of sums of iid.\ random variables. One
would also try to generalize Theorem~8.3 to an estimation about the
distribution of generalized degenerate $U$-statistics. One may hope
that the method of proof of Theorem~8.3 can also be applied for the
study of generalized degenerate $U$-statistics, just as the
distribution of sums independent random variables can be investigated
similarly to the sums of iid. random variables. Probably, the methods
worked out for the study of the problems related to Theorem~8.3 are
helpful, but in the study of generalized degenerate $U$-statistics
first some special questions have to be clarified. We have to find
the right form of the estimation about the distribution of a
generalized degenerate $U$-statistic. In particular, it must be
clarified which are the natural quantities by which we should
express this estimate.
It is natural to expect that generalized degenerate $U$-statistics
$I_{n,k}(\bold f)$ of order~$k$ (without normalization) satisfy the
inequality
$$
P(|I_{n,k}(\bold f)|>u)0$ and $C=C(k)>0$ in a
relatively large interval for the parameter~$u$, where $V^2_n$
denotes the variance of $I_{n,k}(\bold f)$. An essential problem is
to find a relatively good constant $C$ and to determine
the interval $0 u\)$ can be bounded for a degenerate
$U$-statistic $I_{n,2}(f)$ of order~2 by the estimate given in
Theorem~8.3 only if $u\le\const n\sigma^3$, i.e. this condition of
Theorem~8.3 (in the case $k=2$) cannot be dropped. Similar examples
could be constructed for all $k\ge1$. The paper of Arcones and
Gin\'e~[2] contains another example explained by Talagrand to the
authors which also has a similar consequence.
On the other hand, this example does not exclude the possibility to
prove such a multi-dimensional version of Hoeffding's inequality
Theorem~3.3 which provides a slight improvement of Theorems~8.1
and~8.3 similarly to the improvement of Bernstein's inequality
provided by Hoeffding's inequality. Moreover, we can also expect
such a strengthened form of Theorems~8.2 and~8.4 (or of Theorem~4.2
in the one-dimensional case) which takes into account the above
improvements if the supremum of a nice class of random integrals or
degenerated $U$-statistics is considered. There is a hope that some
refinement of the methods of the present work would supply such
results. However, here we did not study this problem.
Theorems~9.2 and~9.3 deal with the properties of degenerate
$U$-statistics. This subject deserves special attention.
Degenerate $U$-statistics can be considered as the multivariate
version of sums of independent and identically distributed
random variables with expectation zero. Similarly, if $f$ is a
canonical function with respect to a measure $\mu$ and put
independent $\mu$-distributed random variables into its arguments,
then the random variables we get in such a way can be considered as
the multivariate version of random variables with expectation zero.
The background of several proofs about the behaviour of
$U$-statistics can be better understood with the help of the above
remark. I tried to explain for instance that the proof about the
Hoeff\-ding decomposition of $U$-statistics (Theorem~9.1) is
actually a natural adaptation of the decomposition of a random
variable to the sum of a random variable with expectation zero
plus the expected value of the random variable.
Hoeff\-ding's decomposition is a fairly well-known result which
can be found for instance in the Appendix of~[12]. Theorem~9.1
slightly differs from the formulation of Hoeff\-ding's decomposition
one usually meets in the literature. It can be exploited that a
$U$-statistic does not change if we replace its kernel function by
its symmetrized version. Beside this, the value of the $U$-statistics
$I_{n,|V|}(f_V)$ do not change if we replace the kernel function
$f_V(x_{j_1},\dots,x_{j_{|V|}})$, $V=\{j_1,\dots,j_{|V|}\}$, by
$f_V(x_1,\dots,x_{|V|})$ in the Hoeffding decomposition (9.3) of the
$U$-statistic $I_{n,k}(f)$, and $f_V(x_1,\dots,x_{|V|})$ is also a
canonical function. The above observations enable us to unify the
contribution of all terms $I_{n,|V|}(f_V)$ with $|V|=l$ for some
$0\le l\le k$ into one non-degenerate $U$-statistics of order $l$.
Generally, the formula obtained in such a way is called the
Hoeff\-ding decomposition in the literature. Nevertheless, we
have applied Theorem~9.1 in this work, because this form of the
Hoeffding's decomposition was more convenient for us.
In our investigations it is important to know that if a function
satisfies a good $L_2$-norm or $L_\infty$-norm estimate, then the
elements of its Hoeff\-ding decomposition also have this property,
and if a class of function is $L_2$-dense, then the same
relation holds for the classes of functions in the Hoeff\-ding
decomposition of the functions in this class. This is the content of
Propositions~9.2 and~9.3. The estimates on the $L_2$-norm given in
formulas~(9.7) and~(9.8) are actually reformulations of some
well-known facts about the properties of conditional expectations.
Theorem~9.4 enables us to reduce the estimates about multiple random
integrals with respect to normalized empirical measures to estimates
about degenerate $U$-sta\-tis\-tics. Such random integrals are
actually sums of $U$-statistics, and we can apply for each of these
$U$-statistics the Hoeff\-ding decomposition. Beside this, as we
consider multiple integrals with respect to a {\it normalized}\/
empirical measure we can expect that a lot of cancellations appear
during the calculation by which we express our random integral in the
form of linear combination of degenerate $U$-statistics. We get such
a representation which enables us to reduce the estimates we want to
prove about multiple random integrals to analogous estimates about
degenerate $U$-statistics. This is the main content of Theorem~9.4
which can be considered as an analog of the Hoeff\-ding decomposition
for multiple stochastic integrals with respect to normalized empirical
measures. This representation of a multiple stochastic integral as
a linear combination of degenerate $U$-statistics of different
order also contains degenerate $U$-statistics of low order. But as
a consequence of the cancellation effects these $U$-statistics
are multiplied with small coefficients. The proof of Theorem~9.4 is
based on a good ``book-keeping'' of the different
contributions to the integral $J_{n,k}(f)$. An essential, although
less spectacular step of this ``book-keeping'' procedure is to express
the terms we are working with by means of the (signed) measures $\mu$
and $\mu^{(l)}-\mu$, i.e. the measures $\mu^{(l)}$ have to be replaced
by their normalizations $\mu^{(l)}-\mu$. The calculations needed in the
proof are quite natural, but unfortunately they contain some unpleasant
and complicated technical details.
Theorem~9.4 also has the consequence that the second moment of the
multiple random integral of a function with respect to a normalized
empirical measure can be bounded by constant times the $L_2$-norm of
the kernel function we integrate. The representation of the stochastic
integrals given in Theorem~9.4 may also contain a non-zero constant
term. This has the unexpected consequence that the expected value of
a multiple random integral with respect to a normalized empirical
measure can be non-zero. Our random integrals may show such an unusual
behaviour because the numbers of sample points falling to disjoint
sets are not independent random variables. But the dependence between
such random variables is very weak, and the expected value of the
random integrals we consider is sufficiently small.
From the pair of Theorems~8.1 and~8.3 I have proved only Theorem
8.3, since its proof is simpler, and by the results of Section~9
Theorem~8.1 follows from it. The proof of Theorem~8.3
is different from its original proof published in paper~[2]. First a
good estimate is presented about the moments of the degenerate
$U$-statistics in Proposition 10.1. Theorem~8.3 can be deduced from
this estimate. Actually the proof is different, first a version
Theorem~$8.3'$ of Theorem~8.3 is proved, where an analogous estimate
is proved for degenerate decoupled $U$-statistic.
The adjective `decoupled' refers to the fact that we put
independent copies of a sequence of iid.\ random variables in
different coordinates of the kernel function of the $U$-statistic.
The study of decoupled $U$-statistics is a popular subject of some
authors. In particular, the main subject of the book~[6] is a comparison
of the properties of $U$-statistics and decoupled $U$-statistics.
The study of decoupled $U$-statistics is simpler than that of usual
$U$-statistics, because the arguments applied in the study of usual
$U$-statistics can be applied for them, and beside this they also
satisfy a multivariate version of the Marcinkiewicz--Zygmund
inequality. On the other hand, the Marcinkiewicz--Zygmund inequality
does not hold for usual $U$-statistics, at least the proofs I know of
do not work for them. We can prove with the help of the multivariate
version of the Marcinkiewicz--Zygmund and Borell's inequality an
estimate about the moments of degenerate $U$-statistics formulated
in Proposition~$10.1'$. Proposition~$8.3'$ can be deduced from
Proposition~$10.1'$, and by a result of de la Pe\~na and
Montgomery--Smith formulated in Theorem~10.4 Theorem~$8.3'$ implies
Theorem~8.3. The results applied in the proof of Theorem~8.3 are
proved in Section~11. Let me also remark that Proposition~10.1 is
not proved in this text, since we chose such an approach where we
do not need it. On the other hand, it follows from the results of
this work and some other standard results about $U$-statistics not
discussed in the present work.
I have mentioned the possibility of another proof of Theorem~8.3
on the basis of the methods of the theory of Wiener--It\^o integrals
to this problem. In~[19] I gave a proof of Theorem~8.1 by means
of the so-called diagram method. Let me also remark that the method
of paper~[21] which yields an improvement of Theorem~8.3 presented in
Theorem~16.1 is actually a refinement of the method in~[19]. Both in
paper~[19] and in the present work the main step of the proof
consists of finding a good estimate on the moments of the random
variables we are investigating. It is enough to estimate the moments
of the type $M=2^m$, where $m$ is a positive integer. For $m=1$ such
an estimate is known, and we can get an estimate for $m>1$ by means
of a recursive procedure. A similar approach is applied in~[19]
and in the present work. The main difference between them is in the
form of the recursive inequality between the moments of the
random variables we work with and the way we prove them.
I found the result about the multivariate version of the
Marcinkiewicz--Zygmund equation in the book~[6], but the proof of
the result given here is different. Only the proof about the upper
estimate of the $p$-th moment of decoupled $U$-statistics is written
down. There is also an estimate in the opposite direction, but such
a result would be interesting for us only for the sake of some
orientation. Theorem~10.4 was proved by de la Pe\~na and
Montgomery--Smith in their paper~[7]. I formulated their result for
separable Banach space valued random variables, just as they did it.
Such a general formulation of the results is very popular in the
literature, but here the discussion of Banach space valued random
variables had a different cause. I also wanted to prove formula
$(10.8')$, a result which is actually not contained in paper~[7].
(Book~[6] contains this result, but the proof is left to the reader.)
The simplest way to get this statement was to prove the original result
in Banach spaces, and to apply it in appropriate $L_\infty$ spaces.
Paper~[7] also contains some kind of converse result of~Theorem~10.4,
but as we do not need it I omitted its discussion.
This work contains the proof of de la Pe\~na and Montgomery--Smith
for Theorem~10.4, but I have explained it in my own style. In
particular, I worked out some details where the author gave only a
very short explanation. This proof is given in the Appendix.
The proof of Borell's inequality is closely related to that of
Nelson's inequality. Edward Nelson published the inequality named
after him in his paper~[27]. He also showed that the general
inequality presented in Appendix~C can be reduced to the inequality
given in formula~(C1) or in Proposition~C2 of this work. This
reduction follows actually from our Theorem~11.2. However, this
observation did not help him to find a proof, and finally he gave a
proof without its application. Borell's inequality can also be
reduced to a one-dimensional statement formulated in Theorem~11.3.
This seems to be a simple inequality, but its proof is surprisingly
hard. Actually in this paper it is enough to prove this inequality
in the special case $q=2$ and $p=2k$, $k=1,2,\dots$. Actually, as I
mentioned in Theorem~16.6, Borell's inequality can be proved
in this special case with better constants. (See paper~[22].)
In the proof of Theorem~11.3 I have followed the paper of Leonhard
Gross {\it Logarithmic Sobolev inequalities}\/~[13]. Gross has
worked out a general theory and he could prove both Nelson's
and Borell's inequality (more precisely an estimate which simply
implies this result) with its help. Gross' method and results are
interesting, because they are very useful in several parts of the
mathematics. (See e.g [16] or~[14].) Let me also remark that similar
results and ideas also appeared in an earlier work of A.~Bonami~[5].
Gross introduced a so-called logarithmic Sobolev inequality related
to Markov processes and showed that it implies another inequality,
which is in the case of a Wiener process Nelson's inequality, while
we can define such a simple Markov process for which the logarithmic
Sobolev inequality corresponding to it yields the proof of
Theorem~11.3. This Markov process is explicitly described in
Section~11, and the logarithmic Sobolev inequality corresponding to
it is also formulated and proved there. Actually Gross showed that
each logarithmic Sobolev inequality is equivalent to the inequality
he proved as its consequence. On the other hand, the proof of the
logarithmic Sobolev inequalities is less difficult than a direct
proof of the inequalities he has obtained as their consequence.
The name `logarithmic Sobolev inequality' has the following
explanation. Generally one calls `Sobolev inequality' such
inequalities where for some pairs of numbers $1\le q u\)\le AP\(\|\bar I_{n,k}(f(\ell))\|>\gamma u\)
\tag10.8b
$$
with some constants $A=A(k)$ and $\gamma=\gamma(k)$ depending only on
the order of these $U$-statistics.
To prove relation (10.8b) first we verify the following statement.
Let us take two independent copies $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$
and $\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ of our original sequence of
random variables $\xi_1,\dots,\xi_n$ and introduce for all sets
$V\subset \{1,\dots,k\}$ the function $\alpha_V(j)$, $1\le j\le k$,
defined as $\alpha_V(j)=1$ if $j\in V$ and $\alpha_V(j)=2$ if
$j\notin V$. Let us define with the help of these quantities the
decoupled generalized $U$-statistics
$$
I_{n,k,V}(f(\ell))=\frac1{k!}\summ\Sb 1\le l_j\le n,\; j=1,\dots,
k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb \!\!\!\!
f_{l_1,\dots,l_k}
\(\xi_{l_1}^{(\alpha_V(1))},\dots,\xi_{l_k}^{(\alpha_V(k))}\)
\quad \text{for all }V\subset \{1,\dots,k\}. \tag B1
$$
The following inequality will be proved: There are some constants
$C_k>0$ and $D_k>0$ depending only on the order $k$ of the generalized
$U$-statistic $I_{n,k}(f(\ell))$ such that for all numbers $u>0$
$$
P\(\|I_{n,k}(f(\ell))\|>u\)\le
\sum_{V\subset\{1,\dots,k\},\,1\le|V|\le k-1} C_kP\(D_k\|
I_{n,k,V}(f(\ell))\|>u\). \tag B2
$$
Here $|V|$ denotes the cardinality of the set $V$, and the condition
$1\le |V|\le k-1$ in the summation of formula (B2) means that we omit
the sets $V=\emptyset$ and $V=\{1,\dots,k\}$ from the summation, i.e.
the cases when either $\alpha_V(j)=1$ for all $1\le j\le k$ or
$\alpha_V(j)=2$ for all $1\le j\le k$ are not considered in this
sum. Formula (10.8b) can be deduced from formula~(B2) by means of a
relatively simple inductive argument. In the proof of
formula~(B2) we shall apply the following simple lemma.
\medskip\noindent
{\bf Lemma B1.} {\it Let $\xi$ and $\eta$ be two independent and
identically distributed random variables taking values on a separable
Banach space~$B$. Then
$$
3P\(|\xi+\eta|>\frac 23u\)\ge P(|\xi|>u)\quad \text{for all }u>0.
$$
}\medskip\noindent
{\it Proof of Lemma B1.}\/ {\it Let $\xi$, $\eta$ and $\zeta$
three independent, identically distributed random variables taking
values in~$B$. Then
$$
\align
3P\(|\xi+\eta|>\frac23 u\)&=P\(|\xi+\eta|>\frac23 u\)+
P\(|\xi+\zeta|>\frac23 u\)+P\(|-(\eta+\zeta)|>\frac23 u\)\\
&\ge P(|\xi+\eta+\xi+\zeta-\eta-\zeta|>2u)=P(|\xi|>u).
\endalign
$$
}\medskip
To prove formula (B2) let us introduce the random variable
$$
T_{n,k}(f(\ell))=\frac1{k!}\summ\Sb 1\le l_j\le n, s_j=1 \text{ or
}s_j=2,\; j=1,\dots, k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb
\!\!\!\!\!\!\!\!\!\!\!
f_{l_1,\dots,l_k}\(\xi_{l_1}^{(s_1)},\dots,\xi_{l_k}^{(s_k)}\)
= \!\!\! \sum_{V\subset\{1,\dots,k\}}\!\!\!\!\!
\bar I_{n,k,V}(f(\ell)), \tag B3
$$
and observe that the random variables $I_{n,k}(f(\ell))$,
$I_{n,k,\emptyset}(f(\ell))$ and $I_{n,k,\{1,\dots,k\}}(f(\ell))$
are identically distributed and the last two random variables are
independent of each other. Hence Lemma~B1 yields that
$$
\align
P(\|I_{n,k}(f(\ell))\|>u)
&\le3P\(\|I_{n,k,\emptyset}(f(\ell))
+I_{n,k,\{1,\dots,k\}}(f(\ell))\|>\frac23u\)\\
&=3P\(\left\|T_{n,k}(f(\ell))-\!\!\!\!\!\!
\sum_{V\:V\subset\{1,\dots,k\},\,
1\le|V|\le k-1} I_{n,k,|V|}(f(\ell))\right\|>\frac23u\) \!\!\!\!\!\!
\\
&\le P(3\cdot2^{k-1}\|T_{n,k}(f(\ell))\|>u) \tag B4 \\
&\qquad+
\!\!\!\!\!\!\!\!\!
\summ_{V\:V\subset\{1,\dots,k\},\, 1\le|V|\le k-1}
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
P(3\cdot2^{k-1}\|I_{n,k,|V|}(f(\ell))\|>u).
\endalign
$$
To deduce relation (B2) from relation (B4) we need a good estimate
on the probability $P(3\cdot2^{k-1}\|T_{n,k}(f(\ell))\|>u)$. We shall
compare the distribution of $\|T_{n,k}(f(\ell))\|$ with that of
$\|I_{n,k,V}(f(\ell))\|$ for an arbitrary set $V\subset\{1,\dots,k\}$
and get an estimate which is sufficient to prove relation~(B2). To
carry out this program first we prove the following lemmas.
\medskip\noindent
{\bf Lemma B2.} {\it Let us consider a sequence of independent random
variables $\e_1,\dots,\e_n$, $P(\e_l=1)=P(\e_l=-1)=\frac12$, $1\le l\le
n$, which is also independent of the random variables
$\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and $\xi_{1}^{(2)},\dots,\xi_n^{(2)}$
appearing in the definition of the decoupled $U$-statistics
$I_{n,k,V}(f(\ell))$ defined in formula (B1). Let us define with their
help the sequences of random variables
$\eta_{1}^{(1)},\dots,\eta_n^{(1)}$
and $\eta_{1}^{(2)},\dots,\eta_n^{(2)}$ whose elements
$(\eta_l^{(1)},\eta_l^{(2)})=(\eta_l^{(1)}(\e_l),\eta_l^{(2)}(\e_l))$,
$1\le l\le n$, are given as
$$
(\eta_l^{(1)}(\e_l),\eta_l^{(2)}(\e_l))=\(\frac{1+\e_l}2\xi_l^{(1)}+
\frac{1-\e_l}2\xi_l^{(2)},\frac{1-\e_l}2\xi_l^{(1)}+
\frac{1+\e_l}2\xi_l^{(2)}\),
$$
i.e. let $(\eta_l^{(1)}(\e_l),\eta_l^{(2)}(\e_l))=
(\xi_l^{(1)},\xi_l^{(2)})$ if $\e_l=1$, and
$(\eta_l^{(1)}(\e_l),\eta_l^{(2)}(\e_l))=
(\xi_l^{(2)},\xi_l^{(1)})$ if $\e_l=-1$, $1\le l\le n$.
Then the joint distribution of the pair of sequences of random
variables $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ agrees with that of the pair of
sequences $\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and
$\eta_{1}^{(2)},\dots,\eta_n^{(2)}$.
Let us fix some $V\subset\{1,\dots,k\}$, and introduce the random
variable
$$
\bar I_{n,k,V}(f(\ell))=\frac1{k!}\summ\Sb 1\le l_j\le n,\; j=1,\dots,
k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb
f_{l_1,\dots,l_k}\(\eta_{l_1}^{(\alpha_V(1))},\dots,
\eta_{l_k}^{(\alpha_V(k))}\), \tag B5
$$
where similarly to formula (B1) $\alpha_V(j)=1$ if $j\in V$, and
$\alpha_V(j)=2$ if $j\notin V$. Then the identity
$$
\align
&2^k\bar I_{n,k,V}(f(\ell)) \tag B6 \\
&\qquad=\frac1{k!}\summ\Sb 1\le l_j\le n, s_j=1 \text{ or }s_j=2,\;
j=1,\dots, k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
(1+\e^{(1)}_{l_1,s_1,V})\cdots (1+\e^{(k)}_{l_k,s_k,V})
f_{l_1,\dots,l_k}\(\xi_{l_1}^{(s_1)},\dots,\xi_{l_k}^{(s_k)}\)
\endalign
$$
holds, where $\e^{(j)}_{l,1,V}=\e_l$, $\e^{(j)}_{l,2,V}=-\e_l$ if
$j\in V$, and $\e^{(j)}_{l,1,V}=-\e_l$, $\e^{(j)}_{l,2,V}=\e_l$ if
$j\notin V$, $1\le l\le n$.} \medskip
In the proof of relation (B2) we need beside Lemma~B2 another result
given in Lemma~B4. Before the formulation of Lemma~B4 we present
Lemma~B3 whose result will be used in its proof.
\medskip\noindent
{\bf Lemma B3.} {\it Let $Z$ be a random variable in a separable
Banach space $B$ with expectation zero, i.e. let $E\kappa(Z)=0$ for all
$\kappa\in B'$. Then $P(\|v+Z\|\ge\|x\|)\ge \inff_{\kappa\in B'}
\frac{(E\kappa(Z))^2}{4E\kappa(Z)^2}$ for all $v\in B$.
Here $B'$ denotes the (Banach) space of all (bounded) linear
transformations on $B$ to the real line.}
\medskip\noindent
{\bf Lemma B4.} {\it Let us consider a sequence of independent
random variables $\e_1,\dots,\e_n$, $P(\e_l=1)=P(\e_l=-1)=\frac12$,
$1\le l\le n$, a polynomial of order $k$ of these random variables
with some coefficients $a(l_1,\dots,l_s)$, $1\le s\le k$,
$1\le l_s\le n$, from some separable Banach
space $B$. Let us assume that the coefficients of this polynomial
satisfy the relation $a(l_1,\dots,l_s)=0$ if $l_p=l_q$ with some
$1\le p__

\|v\|\)\ge c_k \tag B7 $$ holds for all $v\in B$ with some constant $c_k>0$ depending only on the order $k$ of this polynomial.} \medskip\noindent {\it The proof of Lemma B2.}\/ Let us consider the conditional joint distribution of the sequences of random variables $\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and $\eta_{1}^{(2)},\dots,\eta_n^{(2)}$ under the condition that the random vector $\e_1,\dots,\e_n$ takes the value of some prescribed $\pm1$ series of length~$n$. Observe that this conditional distribution agrees with the joint distribution of the sequences $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and $\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ for all possible conditions. This fact implies the statement about the joint distribution of the sequences $\eta_l^{(1)},\eta_l^{(2)}$, $1\le l\le n$. To prove identity~(B6) let us fix a set $M\subset\{1,\dots,n\}$ and consider the case when $\e_l=1$ if $l\in M$ and $\e_l=-1$ if $l\notin M$. Observe that for all fixed sequences $1\le l_1,\dots,l_k\le n$, $l_j\neq l_{j'}$ if $j\neq j'$ $$ f_{l_1,\dots,l_k} \(\eta_{l_1}^{(\alpha_V(1))},\dots, \eta_{l_k}^{(\alpha_V(k))}\)= f_{l_1,\dots,l_k} \(\xi_{l_1}^{(\beta_{V,M}(1,l_1))},\dots, \xi_{l_k}^{(\beta_{V,M}(k,l_k))}\), $$ where $\beta_{V,M}(j,l)=1$ if $j\in V$ and $l\in M $ or $j\notin V$ and $l \notin M)$, and $\beta_{V,M}(j,l)=2$ otherwise. On the other hand, $$ \align &\summ_{s_j=1 \text{ or }s_j=2,\;j=1,\dots, k} (1+\e^{(1)}_{l_1,s_1,V})\cdots (1+\e^{(k)}_{l_k,s_k,V}) f_{l_1,\dots,l_k}\(\xi_{l_1}^{(s_1)},\dots,\xi_{l_k}^{(s_k)}\)\\ &\qquad\qquad \qquad=2^k f_{l_1,\dots,l_k} \(\xi_{l_1}^{(\beta_{V,M}(1,l_1))},\dots, \xi_{l_k}^{(\beta_{V,M}(k,l_k))}\), \endalign $$ since the product $(1+\e^{(1)}_{l_1,s_1,V})\cdots (1+\e^{(k)}_{l_k,s_k,V})$ equals either zero or $2^k$, and $\e^{(j)}_{l_j,s_j,V}=1$ if $\beta_{V,M}(j,l_j)=s_j$, and $\e^{(j)}_{l_j,s_j,V}=-1$ if $\beta_{V,M}(j,l_j)\neq s_j$. Summing up these identities for all $1\le l_1,\dots,l_k\le n$ such that $l_j\neq l_{j'}$ if $j\neq j'$ we get identity~(B6). \medskip\noindent {\it The proof of Lemma B3.}\/ Let us first observe that if $\xi$ is a real valued random variable with zero expectation, then $P(\xi>0) \ge \frac{(E|\xi|)^2}{4E\xi^2}$ since $(E|\xi|)^2 =4(E(\xi I(\{\xi>0\}))^2\le 4P(\xi>0)E\xi^2$ by the Schwarz inequality, where $I(A)$ denotes the indicator function of the set $A$. Given some $v\in B$ let us choose a linear operator $\kappa$ such that $\|\kappa\|=1$ and $\kappa(v)=\|v\|$. Such an operator exists by the Banach--Hahn theorem. Observe that $\{\oo\:\|v+Z(\oo)\|\ge \|v\|\} \supset \{\oo\: \kappa(v+Z(\oo))\ge\kappa(v)\} =\{\oo\:\kappa(Z(\oo))\ge0\}$. Beside this $E\kappa(Z)=0$. Hence we can apply the above proved inequality for $\xi=\kappa(Z)$, and it yields that $P(\|v+Z\|\ge\|v\|) \ge \frac{E\kappa(Z)^2}{4(E\kappa(Z))^2}$. Lemma~B3 is proved. \medskip\noindent {\it Proof of Lemma B4.}\/ Take the class of random polynomials $$ Y=\sum_{s=1}^k\sum \Sb 1\le l_j\le n,\; j=1,\dots, s\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb b(l_1,\dots,l_s)\e_{l_1}\cdots\e_{l_s}, $$ where $\e_l$, $1\le l\le n$, are independent random variables with $P(\e_l=1)=P(\e_l=-1)=\frac12$, and the coefficients $b(l_1,\dots,l_s)$, $1\le s\le k$, are arbitrary real numbers. It is enough to show that there exists a constant $c_k$ depending only on the order~$k$ of these polynomials such that the inequality $$ (E|Y|)^2\ge 4c_k EY^2. \tag B8 $$ holds for all of these polynomials~$Y$. Indeed, formula (B7) follows from relation~(B8) and Lemma B3 with $c_k\ge\inff_\kappa\frac{(E\kappa(Z))^2}{4E\kappa(Z)^2}$ if we apply them for the vector $v\in B$ in formula (B7) and $$ Z=\sum_{s=1}^k\sum \Sb 1\le l_j\le n,\; j=1,\dots, s\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb a(l_1,\dots,l_s)\e_{l_1}\cdots\e_{l_s}, $$ and the infimum is taken for all bounded linear operators $\kappa$ on the Banach space $B$. But this inequality follows from relation (B8). To prove relation (B8) first we compare the moments $EY^2$ and $EY^4$. Let us introduce the random variables $$ Y_s=\sum \Sb 1\le l_j\le n,\; j=1,\dots, s\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb b(l_1,\dots,l_s)\e_{l_1}\cdots\e_{l_s} \quad 1\le s\le k, $$ and observe that because of Borell's inequality (Theorem~10.2) and the uncorrelatedness of the random variables $Y_s$, $1\le s\le k$, $$ \align EY^4&=\(\sum_{s=1}^k Y_s\)^4\le k^3\sum_{s=1}^k EY_s^4\le k^3 3^{3k/2} \sum_{s=1}^k (EY_s^2)^2\\ &\le k^3 3^{3k/2}\(\sum_{s=1}^k EY_s^2\)^2=k^3 3^{3k/2}(EY^2)^2. \endalign $$ This estimate together with the H\"older inequality yield that $EY^2=E(Y^4)^{1/3}|Y|^{2/3}\le (EY^4)^{1/3}(E|Y|)^{2/3}\le k3^{k/2}(EY^2)^{1/3}(E|Y|)^{2/3}$, i.e. $EY^2\le k^{3/2}3^{3k/4}(E|Y|)^2$, and relation (B8) holds with $4c_k=k^{-3/2}3^{-3k/4}$. Lemma~B4 is proved. \medskip Let us turn back to the estimation of the probability $P(3\cdot2^{k-1}\|T_{n,k}(f)\|>u)$. Let us introduce the $\sigma$-algebra $\Cal F=\Cal B(\xi_l^{(1)},\xi_l^{(2)},\,1\le l\le n)$ generated by the random variables $\xi_l^{(1)},\xi_l^{(2)}$, $1\le l\le n$, and fix some set $V\subset\{1,\dots,k\}$. We claim that there exists some constant $c_k>0$ that the random variable $\bar I_{n,k,V}(f(\ell))$ defined in formula~(B5) satisfies the inequality $$ P\(\|2^k\bar I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\||\Cal F\) \ge c_k \quad \text{ with probability 1.} \tag B9 $$ Indeed, formula (B6) and the independence of the random sequences $\e_{l,V}$, $\xi^{(1)}_l$ and $\xi^{(2)}_l$, $1\le l\le n$ yield that $$ \align &P\(\|2^k\bar I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\||\Cal F\)\\ &\qquad=P_{\e_V}\biggl(\biggl\|\frac1{k!} \!\! \summ\Sb 1\le l_j\le n, s_j=1 \text{ or }s_j=2,\; j=1,\dots, k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! (1+\e^{(1)}_{l_1,s_1,V})\cdots (1+\e^{(k)}_{l_k,s_k,V}) f_{l_1,\dots,l_k}\! \(\xi_{l_1}^{(s_1)},\dots,\xi_{l_k}^{(s_k)}\)\biggr\| \\ &\qquad \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \qquad >\|T_{n,k}(f(\ell))\|\biggr), \tag B10 \endalign $$ where $P_{\e_V}$ means that we fix the values of the random variables $\xi_l^{(1)}$, $\xi_l^{(2)}$, $1\le l\le n$ and take the probability with respect to the remaining random variables $\e^{(j)}_{l,s,V}$, $1\le j\le k$, $1\le l\le n$, and $s=1$ or $s=2$. Let us observe that the probability considered at the right-hand side of (B10) is a polynomial of order~$k$ of the random variables $\e_1,\dots,\e_n$. (The terms $\e^{(j)}_{l_j,s_j,V}$ taking part in it equal either $\e_{l_j}$ or $-\e_{l_j}$ depending on the parameters~$j$ and~$s_j$.) Beside this the constant term of this polynomial equals~$T_{n,k}(f)$. Hence this probability can be bounded by means of Lemma~B4, and this result yields relation (B9). Relation (B9) implies that $$ \align &P(\|2^k\bar I_{n,k,V}(f(\ell))\|\ge3\cdot2^{k-1} u) \\ &\qquad \ge P(\|2^k\bar I_{n,k,V}(f(\ell))\|\ge\|T_{n,k}(f(\ell))\|, \|T_{n,k}(f(\ell))\|\ge3\cdot2^{k-1} u)\\ &\qquad=\int_{\{\oo\: \|T_{n,k}(f(\ell))(\oo)\|\ge3\cdot2^{k-1} u\}} P\(\|2^k\bar I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\||\Cal F\)\,dP\\ &\qquad \ge c_k P(\|T_{n,k}(f(\ell))\|\ge3\cdot2^{k-1} u) \endalign $$ The last inequality with the choice of any set $V\subset\{1,\dots,k\}$, $1\le |V|\le k-1$, together with relation~(B4) imply formula~(B2). To formulate the inductive hypothesis we need to prove formula (10.8b) with the help of relation~(B2) first we introduce the following quantities. Let $\Cal W=\Cal W(k)$ denote the set of all partitions of the set $\{1,\dots,k\}$. Let us fix $k$ independent copies $\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of the sequence of random variables $\xi_{1},\dots,\xi_n$. Given a partition $W=(V_1,\dots,V_s)\in\Cal W(k)$ let us introduce the function $s_W(j)$, $1\le j\le k$, which tells for all arguments $j$ the index of that element of the partition~$W$ which contains the point $j$, i.e. the function $s_W(j)$, $1\le j\le k$, is defined by the relation $j\in V_{s_W(j)}$. Let us define (actually generalizing the notion introduced in formula~(B1)) the notion of generalized decoupled $U$-statistics corresponding to a partition $W\in\Cal W(k)$ as $$ I_{n,k,W}(f(\ell))=\frac1{k!}\summ\Sb 1\le l_j\le n,\;j=1,\dots, k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb f_{l_1,\dots,l_k}\(\xi_{l_1}^{(s_W(1))},\dots,\xi_{l_k}^{(s_W(k))}\) \quad \text{for all }W\in\Cal W(k). $$ Given a partition $W=(V_1,\dots,V_s)$ let us call the number $s$ of the elements of this partition the rank both of the partition $W$ and of the generalized decoupled $U$-statistic $I_{n,k,W}(f(\ell))$. Relation (10.8b) will be proved by induction with respect to the order $k$ of the $U$-statistics. This induction assumption clearly holds for $k=1$, so when we prove it for $k$ we may assume that it holds for all $k'0$ and $\delta(k,j)>0$ such that for all generalized decoupled $U$-statistics $I_{n,k,W}(f(\ell))$ of order $k$ $$ \aligned &P(\|I_{n,k,W}(f(\ell))\|>u)\le C(k,j)P\(\|\bar I_{n,k}(f(\ell))\|>\delta(k,j) u\) \\ &\qquad\text{for all }2\le j\le k \text{ if the rank of } W \text{ equals }j. \endaligned \tag B11 $$ (In relation (B11) we compare the distribution of some generalized decoupled $U$-sta\-tis\-tics with that of the decoupled $U$-statistic $\bar I_{n,k}(f(\ell))$.) We shall prove this statement by means of a backward induction with respect to the rank $j$ of the generalized decoupled $U$-statistics. Relation (B11) clearly holds for $j=k$ with $C(k,k)=1$ and $\delta(k,k)=1$. To prove it for generalized decoupled $U$-statistics of rank $2\le j u)\le \bar A(k) P\(\|I_{n,k,\bar W} (f(\ell))\|>\bar \gamma(k) u\) \tag B12 $$ with $\bar A(k)=\supp_{j\le k-1}A(j)$, $\gamma(k)=\inff_{j\le k-1}\gamma(j)$ if the rank $j$ of $W$ is such that $2\le j\le k-1$. To prove relation~(B12) let us define the $\sigma$-algebra $\Cal F$ generated by the random variables appearing in the first $s-1$ coordinates of these generalized $U$-statistics. We show that relation (10.8b) for $U$-statistics of order $k-s+1\le k-1$ yields that $P(\|I_{n,k,W}(f(\ell))\|>u|\Cal F)\le \bar A(k) P\(\|I_{n,k,\bar W}(f(\ell))\|>\bar\gamma(k) u|\Cal F\)$ with probability~1. This inequality follows from our inductive hypothesis, since the conditional probabilities we compare here are generalized $U$-statistics and generalized decoupled $U$-statistics of order $k-s+1$ we get by putting substituting the (known) first $s-1$ coordinates in the generalized $U$-statistics $I_{n,k,W}(f(\ell))$ and $I_{n,k,\bar W}(f(\ell))$. Then taking expectation at both sides of this inequality we get relation~(B12). As the rank of $\bar W$ is strictly greater than the rank of $W$ relation (B12) together with our backward inductive assumption imply relation (B11) for all $2\le j\le k$. Inequality (10.8b) is a simple consequence of relations~(B2) and~(B11). Indeed, the probability $P\(\|I_{n,k}(f(\ell))\|>u\)$ is bounded in formula~(B2) by such an expression, where some linear combination of the probabilities are considered that certain generalized decoupled $U$-statistics of order $k$ and rank~2 are larger than $uD_k^{-1}$. Each of these terms can be bounded by means of relation~(B11), and in such a way we get relation~(10.8b). We prove formula $(10.8')$ first in the simpler case when the supremum of finitely many functions is taken. Let us have $M$ functions $f_1,\dots,f_M$, and to prove relation $(10.8')$ in this case let us apply formula (10.8) with the function $f=(f_1,\dots,f_M)$ taking values in the separable Banach space $B_M$ consisting of the points $(v_1,\dots,v_M)$, $v_j\in B$, $1\le j\le M$, with the norm $\|(v_1,\dots,v_M)\|=\supp_{1\le j\le m}\|v_j\|$. The application of formula (10.8) with this choice yields formula $(10.8')$ in this case. Let us emphasize that the constants appearing in this estimate do not depend on the number $M$. Since the distribution of the random variables $\supp_{1\le s\le M} \left\| I_{n,k}(f_s)\right\|$ converge to $\supp_{1\le s<\infty} \left\| I_{n,k}(f_s)\right\|$, the distribution of the random variables $\supp_{1\le s\le M} \left\| \bar I_{n,k}(f_s)\right\|$ converge to $\supp_{1\le s<\infty} \left\|\bar I_{n,k}(f_s)\right\|$ as $M\to\infty$, we get the proof of relation $(10.8')$ in the general case by taking limit $M\to\infty$ in this relation. \beginsection Appendix C. {\it Nelson's inequality and its application} \medskip\noindent In this part of the Appendix I formulate and prove Nelson's inequality and briefly indicate how it can be applied in the proof of Theorem 8.5, i.e. in the Gaussian counterpart of Theorems~8.3 and~8.4. As the latter problem does not belong to the main subject of the work, the detailed explanation of some background results I shall apply in the proof will be omitted. In particular, I do not discuss the basic results about the properties of multiple Wiener--It\^o integrals. These results can be found for instance in my lecture note {\it Multiple Wiener--It\^o integrals}.\/ There are several equivalent formulations of Nelson's inequality. First I present its terminologically simplest form. Before its formulation let me recall that the Hermite polynomials $H_k(x)$, $k=0,1,2,\dots$, are those polynomials which constitute an orthogonal system with respect to the normal density function $\varphi(x)=\frac1{\sqrt{2\pi}}e^{-x^2/2}$. To fix their normalization, let us make the agreement that $H_k(x)$ is a polynomial of order $k$, and the coefficient of its leading term $x^k$ equals 1. \medskip\noindent {\bf Theorem C1. (Nelson's inequality).} {\it Let $(Y,\Cal Y,\nu) =(R^\infty,\Cal B^\infty,\nu^\infty)$ be the direct product of infinite many copies of the space $(R,\Cal B,\lambda_{\varphi})$, where $R$ denotes the real line, $\Cal B$ is the Borel $\sigma$-algebra on it, $\lambda_\varphi$ is the measure determined by the standard normal distribution function, i.e. the probability measure which is absolutely continuous with respect to the Lebesgue measure with density function $\varphi(y)=\frac1{\sqrt{2\pi}}e^{-y^2/2}$. Given a number $\gamma>0$ introduce the operator $\bold T_\gamma$ on $(Y,\Cal Y)$ by defining it first on polynomials by the formula $$ \align &\bold T_\gamma\(c_{l_1,j_1,\dots,l_s,j_s}\sum H_{l_1}(y_{j_1}) \cdots H_{l_s}(y_{j_s})\) \\ &\qquad =\sum \gamma^{l_1+\cdots+l_s}c_{l_1,j_1,\dots,l_s,j_s} H_{l_1}(y_{j_1})\cdots H_{l_s}(y_{j_s}), \endalign $$ where all finite sums of the above form are considered, and $H_l(\cdot)$ denotes the Hermite polynomial of order $l$. Let us extend this linear operator to general functions on the space $(Y,\Cal Y)$ in the natural way. Fix two numbers $1 0$ there exists some constant $C=C(K)>0$ such that $P(|Z_m^{(n)}|>x)\le Cx^K$ for all $x\ge1$ and $n=1,2,\dots$. This fact implies that relation (C4) also holds for continuous functions $f$ such that $|f(x)|\le C(1+|x|)^K)$ with some constant $C>0$ and $K>0$, where $x=(x_1,\dots,x_m)$ and $|x|$ is the length of the vector~$x$. This strengthened form of (C5) enables us to take the limit $n\to\infty$ in formula (C4) and to get relation (C3) in such a way. Let us apply formula (C3) with the choice $\xi_j=W\(\frac jm\) -W\(\frac{j-1}m\)$, $1\le j\le m$. Observe that for all indices~$l$ the inner sums at both sides of this expression are approximative sums for the Wiener--It\^o integral $\int W(\,dx_1)\cdots W(\,dx_l)$. Hence it is natural to expect that by applying the limiting procedure $m\to\infty$ in formula (C3) with the above choice of the random variables $\xi_j$ we get relation (C2). This belief is correct, only its justification requires the application of some deeper results from the theory of Wiener--It\^o integrals. We need some estimate which states that also the high moments of a Wiener--It\^o integral with a small kernel function are small. We can apply the following result. If $h$ is such a function in $[0,1]^l$ for which $\int |h(x_1,\dots,x_l)|^{2}\,dx_1\dots\,dx_l<\e$ with some $\e>0$, then also the inequality $E\left|\int h(x_1,\dots,x_l)W(\,dx_1)\dots W(\,dx_l)\right|^{2K}\le C(K,l)\e^K$ holds for all $K=1,2,\dots$ with some constant $C(K,l)$ depending only on $K$ and $l$. But the proof of this estimate demands some deeper results about Wiener--It\^o integrals. (In my lecture note about Wiener--It\^o integrals this result is proved as a consequence of the so-called diagram formula.) By applying this limiting procedure we get the proof of (C2). In such a way we have proved Proposition~C2 which, as we have shown, implies Theorem~C1. \medskip Now I formulate a version of Nelson's inequality presented in the language of Wiener--It\^o integrals. \medskip\noindent {\bf Theorem C3.} {\it Let us fix a measurable space $(X,\Cal X)$ together with a countable non-atomic measure $\mu$ on it, and let $Z_\mu$ be an orthogonal Gaussian random measure with counting measure $\mu$ on $(X,\Cal X)$. (See the definition of counting measure before the formulation of Theorem~(8.5).) For the sake of simplicity let us assume that the space $L_2(X,\Cal X,\mu)$ is separable. Let us have a sequence of measurable functions $f_k(x_1,\dots,x_k)$ on $(X^k,\Cal X^k)$ of real constant $c_k$, $k=1,2,\dots$, and also a constant $c_0$ such that $$ c_0^2+\sum_{k=1}^\infty \frac{c_k^2}{k!}\int f_k^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty. \tag C6 $$ Then $$ \aligned &E\left|c_0+\sum_{k=1}^\infty \gamma^k \frac{c_k}{k!}\int f_k(x_1,\dots,x_k)Z_\mu(\,dx_1)\dots Z_\mu(\,dx_k)\right|^p \\ &\qquad \le \[E\left|c_0+\sum_{k=1}^\infty \frac{c_k}{k!} \int f_k(x_1,\dots,x_k) Z_\mu(\,dx_1)\dots Z_\mu(\,dx_k)\right|^q\]^{p/q}. \endaligned \tag C7 $$ if $1u)\le \frac{EZ_{\mu,k}(f)^{2M}}{u^{2M}}\le\(\frac{(2M)^k\sigma^2}{u^2}\)^M $$ if $M\ge1$. We get with the choice $M=\frac{1}{2e}\(\frac u\sigma\)^{2/k}$ that $$ P(|Z_{\mu,k}(f)|>u)\le \exp\left\{-\frac k{2e}\(\frac u\sigma\)^{2/k} \right\} \quad \text{if } u>(2e)^{k/2}\sigma. $$ By choosing a sufficiently large $A\ge1$ at the right-hand side of this inequality we get that formula (8.11) holds for all $u\ge0$. The second inequality (8.12) of Theorem~8.5 can be proved in the same way as Theorem 4.1 in the one-dimensional case. No difficulty arises during the proof. The main point is that inequality (8.11) holds for all $u>0$, hence the chaining argument applied in the proof of Theorem~4.1 supplies the proof also in this case. I omit the details. \medskip Let me finally remark that Leonhard Gross in his paper {\it Logarithmic Sobolev inequalities}\/ also gave a proof of the Nelson inequality by means of the hypercontractive inequality for Rademacher functions. He showed that the central limit theorem enables us to prove that the logarithmic Sobolev inequality holds not only for the Markov process considered in Section~11, but also for Wiener processes. This result together with the general theory he presents imply an inequality which is equivalent to our formula~(C1). \parskip=1pt plus 0.5pt \beginsection References: \item{1.)} Alexander, K. (1987) The central limit theorem for empirical processes over Vapnik--\v{C}ervonenkis classes. {\it Ann. Probab.} {\bf 15}, 178--203 \item{2.)} Arcones, M. A. and Gin\'e, E. (1993) Limit theorems for $U$-processes. {\it Ann. Probab.} {\bf 21}, 1494--1542 \item{3.)} Arcones, M. A. and Gin\'e, E. (1994) $U$-processes indexed by Vapnik--\v{C}ervonenkis classes of functions with application to asymptotics and bootstrap of $U$-statistics with estimated parameters. {\it Stoch. Proc. Appl.} {\bf 52}, 17--38 \item{4.)} Bennett, G. (1962) Probability inequality for the sum of independent random variables. {\it J. Amer. Statist. Assoc.}\/ {\bf 57}, 33-45 \item{5.)} Bonami, A. (1970) \'Etude des coefficients de Fourier des fonctions de $L^p(G)$. {\it Ann. Inst. Fourier} {\bf 20}, 335--402 \item{6.)} de la Pe\~na, V. H. and Gin\'e, E. (1999) {\it Decoupling. From dependence to independence.}\/ Springer series in statistics. Probability and its application. Springer Verlag, New York, Berlin, Heidelberg \item{7.)} de la Pe\~na, V. H. and Montgomery--Smith, S. (1995) Decoupling inequalities for the tail-probabilities of multivariate $U$-statistics. {\it Ann. Probab.}, {\bf 23}, 806--816 \item{8.)} Dudley, R. M. (1978) Central limit theorems for empirical measures. {\it Ann. Probab.}\/ {\bf 6}, 899--929 \item{9.)} Dudley, R. M. (1984) A course on empirical processes. {\it Lecture Notes in Mathemematics}\/ {\bf 1097}, 1--142 Springer Verlag, New York \item{10.)} Dudley, R. M. (1989) {\it Real Analysis and Probability.}\/ Wadsworth \& Brooks, Pacific Grove, California \item{11.)} Dudley, R. M. (1998) {\it Uniform Central Limit Theorems.}\/ Cambridge University Press, Cambridge U.K. \item{12.)} Dynkin, E. B. and Mandelbaum, A. (1983) Symmetric statistics, Poisson processes and multiple Wiener integrals. {\it Annals of Statistics\/} {\bf 11}, 739--745 \item{13.)} Gross, L. (1975) Logarithmic Sobolev inequalities. Amer. J. Math. {\bf 97}, 1061--1083 \item{14.)} Guionnet, A. and Zegarlinski, B. (2003) Lectures on Logarithmic Sobolev inequalities. {\it Lecture Notes in Mathematics} {\bf 1801} 1--134 2. Springer Verlag, New York \item{15.)} Hoeffding, W. (1963) Probability inequalities for sums of bounded random variables. {\it J. Amer. Math. Society}\/ {\bf 58}, 13--30 \item{16.)} Ledoux, M. (1996) On Talagrand deviation inequalities for product measures. {\it ESAIM: Probab. Statist.}\/ {\bf 1.} 63--87. Available at http://www.emath./fr/ps/. \item{17.)} Major, P. (1981) Multiple Wiener--It\^o integrals. {\it Lecture Notes in Mathematics\/} {\bf 849}, Springer Verlag, Berlin, Heidelberg, New York, \item{18.)} Major, P. (1988) On the tail behaviour of the distribution function of multiple stochastic integrals. {\it Probability Theory and Related Fields}, {\bf 78}, 419--435 \item{19.)} Major, P. (2005) An estimate about multiple stochastic integrals with respect to a normalized empirical measure. Submitted to {\it Studia Scientarum Mathematicarum Hungarica.} \item{20.)} Major, P. (2005) An estimate on the maximum of a nice class of stochastic integrals. Submitted to {\it Probability Theory and Related Fields}, \item{21.)} Major, P. (2005) On a multivariate version of Bernstein's inequality submitted to {\it Ann. Probab.} \item{22.)} Major, P. (2005) A multivariate generalization of Hoeffding's inequality. Submitted to {\it Ann. Probab.} \item{23.)} Major, P. (2005) On the tail behaviour of multiple random integrals and $U$-sta\-tis\-tics, on the supremum of classes of such quantities, and some related questions. (An overview work submitted to xxx) \item{24.)} Major, P. and Rejt\H{o}, L. (1988) Strong embedding of the distribution function under random censorship. {\it Annals of Statistics}, {\bf 16}, 1113--1132 \item{25.)} Major, P. and Rejt\H{o}, L. (1998) A note on nonparametric estimations. In the conference volume to the 65. birthday of Mikl\'os Cs\"org\H{o}. 759--774 \item{26.)} Massart, P. (2000) About the constants in Talagrand's concentration inequalities for empirical processes. {\it Ann. Probab.}\/ {\bf 28}, 863--884 \item{27.)} Nelson, E. (1973) The free Markov field. J. Functional Analysis {\bf 12}, 211--227 \item{28.)} Pollard, D. (1984) {\it Convergence of Stochastic Processes.}\/ Springer Verlag, New York \item{29.)} Talagrand, M. (1996) New concentration inequalities in product spaces. {\it Invent. Math.} {\bf 126}, 505--563 \item{30.)} Vapnik, V. N. (1995) {\it The Nature of Statistical Learning Theory.} Springer Verlag, New York \vfill\eject \centerline {\script Content} $$ \vbox{\halign{\hfill # \ &\vtop{\hsize=12truecm\parindent=0pt # \vskip3pt} \quad &\vtop{\hsize=0.5truecm\parindent=0pt # \vskip3pt} \cr 1. & Introduction \dotfill &\rightline{1}\cr 2. & Motivation of the investigation. Discussion of some problems \dotfill & \rightline{3}\cr 3. & Some estimates about sums of independent random variables \dotfill & \rightline{10}\cr 4. & On the supremum of a nice class of partial sums \dotfill & \rightline{15}\cr 5. & Vapnik--\v{C}ervonenkis classes and $L_2$-dense classes of functions \dotfill & \rightline{23}\cr 6. & The proof of Theorems 4.1 and 4.2 on the supremum of random sums \dotfill & \vskip5pt \rightline{26} \cr 7. & The completion of the proof of Theorem 4.1 \dotfill & \rightline{33}\cr 8. & Formulation of the main results of this work \dotfill & \rightline{40}\cr 9. & Some results about $U$-statistics \dotfill & \rightline{47}\cr 10. & The proof of Theorem 8.3 about the distribution of $U$-statistics \dotfill & \rightline{60}\cr 11. & Some useful basic results \dotfill & \rightline{69}\cr 12. & Reduction of the main result in this work \dotfill & \rightline{79}\cr 13. & The strategy of the proof for the main result of this paper \dotfill & \rightline{87}\cr 14. & A symmetrization argument \dotfill & \rightline{92}\cr 15. & The proof of the main result \dotfill &\rightline{105}\cr 16. & The improvement of some results in Section 8. \dotfill &\rightline{115}\cr 17. & An overview of the results in this work \dotfill &\rightline{121}\cr \noalign{\medskip} &Appendix A. The proof of some results about Vapnik--\v{C}ervonenkis classes \dotfill & \vskip5pt \rightline{134}\cr & Appendix B. The proof of Theorem 10.3. (A result of de le Pe\~na and Montgomery--Smith) \dotfill &\vskip5pt \rightline{135}\cr &Appendix C. Nelson's inequality and its application \dotfill &\rightline{143}\cr \noalign{\medskip} &References \dotfill & \rightline{148}\cr}} $$ \bye