# Nonparametric Bayesian estimation of several black-box functions of different variables...

Discussion in 'Mathematics' started by Fabrice Pautot, Oct 8, 2018.

1. In order to introduce my problem, let’s start with the nonparametric estimation of a single unknown/black-box function $f:{\Omega _f} \to \mathbb{R}$ of a discrete variable $x$ in a finite domain ${\Omega _f} \triangleq \left\{ {{x^1},{x^2},...,{x^m}} \right\}$ such as Cartesian grid after discretization of a continuous function.

Hence we want to estimate the values $f\left( x \right)$ for any $x \in {\Omega _f}$ from samples ${\left( {{x_i},{z_i}} \right)_{i = 1,N}} \triangleq \left( {{\mathbf{x}},{\mathbf{z}}} \right)$ corrupted by additive noise. Our nonparametric regression/interpolation model is

$Z = f\left( X \right) + \xi$

Suppose $\xi \sim \mathcal{N}\left( {0,{\sigma _0}^2} \right)$ and ${\sigma _0}$ is given a priori. Let ${\mathbf{f}} \triangleq {\left( {f\left( {{x^1}} \right),f\left( {{x^2}} \right),...,f\left( {{x^m}} \right)} \right)^T}$ be the vector of parameters to estimate. We have the Gaussian conditional likelihood

$p\left( {\left. {\mathbf{z}} \right|{\mathbf{x}},{\mathbf{f}},{\sigma _0}} \right) \propto \prod\limits_{i = 1}^N {{e^{ - \frac{{{{\left( {{z_i} - {x_i}} \right)}^2}}}{{2{\sigma _0}^2}}}}} \propto {e^{ - \frac{1}{2}{{\mathbf{f}}^T}{\mathbf{Df}} + {{\mathbf{J}}^T}{\mathbf{f}}}}$

where ${\mathbf{D}}$ is the diagonal matrix

${\mathbf{D}} \triangleq {\sigma _0}^{ - 2}{\text{diag}}\left( {\left| {\left\{ {i = 1,N\backslash {x_i} = {x^1}} \right\}} \right|,\left| {\left\{ {i = 1,N\backslash {x_i} = {x^2}} \right\}} \right|,...,\left| {\left\{ {i = 1,N\backslash {x_i} = {x^m}} \right\}} \right|} \right)$

and

${{\mathbf{J}}^T} \triangleq \sigma _0^{ - 2}\left( {\sum\limits_{\left\{ {i = 1,N\backslash {x_i} = {x^1}} \right\}} {{z_i}} ,\sum\limits_{\left\{ {i = 1,N\backslash {x_i} = {x^2}} \right\}} {{z_i}} ,...,\sum\limits_{\left\{ {i = 1,N\backslash {x_i} = {x^m}} \right\}} {{z_i}} } \right)$

We introduce a Gaussian smoothing/regularization prior

$p\left( {\left. {\mathbf{f}} \right|\varepsilon } \right) \propto {\varepsilon ^m}{e^{ - \frac{{{\varepsilon ^2}}}{2}{{\mathbf{f}}^T}{\mathbf{Rf}}}}$

where ${\mathbf{R}}$ is a symmetric regularization matrix and $\varepsilon$ is the regularization hyperparameter. We have

${\mathbf{R}} = {{\mathbf{L}}^T}{\mathbf{L}}$

where ${\mathbf{L}}$ is a linear operator such as a discrete second-order derivative/Laplacian operator obtained from a finite differences numerical scheme such as the forward/centered/backward second-order accuracy scheme for a Cartesian grid with step $\Delta x$

${\mathbf{L}} = \frac{1}{{{{\left( {\Delta x} \right)}^2}}}\left( {\,\begin{array}{*{20}{c}} 2&{ - 5}&4&{ - 1}&0& \cdots &0 \\ 1&{ - 2}&1&0&0& \cdots & \vdots \\ 0&1&{ - 2}&1&0& \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots &0 \\ \vdots & \ddots &0&1&{ - 2}&1&0 \\ \vdots & \cdots &0&0&1&{ - 2}&1 \\ 0& \cdots &0&1&4&{ - 5}&2 \end{array}} \right)$

Note that ${\mathbf{L}}$ and ${\mathbf{R}}$ are only semi positive definite with rank deficiency 2 so that $p\left( {\left. {\mathbf{f}} \right|\varepsilon } \right)$ is a degenerate, improper Gaussian probability distribution without a probability density function and a mathematical expectation. Singularity of ${\mathbf{R}}$ is essential for regularization: if ${\mathbf{R}}$ is non-singular/positive definite, then $\mathbb{E}{\mathbf{f}} = {\mathbf{0}}$ . This has nothing to do with regularization and is wrong in general unless we precisely have this piece of prior information. For this reason, in order to make the posterior probability distributions proper/convergent, it is necessary to introduce an hyperprior like

$p\left( \varepsilon \right) \propto {\varepsilon ^{ - \alpha }}$

with $\alpha > 2$. So, let

$p\left( \varepsilon \right) \propto {\varepsilon ^{ - 3}}$

After analytical marginalization of ${\mathbf{f}} \in {\mathbb{R}^m}$ , the marginal posterior probability for the regularization hyperparameter reads

$p\left( {\left. \varepsilon \right|{\mathbf{x}},{\mathbf{z}},{\sigma _0}^2} \right) = \frac{1}{Z}{\varepsilon ^{m - 3}}{\left| {{\varepsilon ^2}{\mathbf{R}} + {\mathbf{D}}} \right|^{ - \frac{1}{2}}}e{}^{\frac{1}{2}{{\mathbf{J}}^T}{{\left( {{\varepsilon ^2}{\mathbf{R}} + {\mathbf{D}}} \right)}^{ - 1}}{\mathbf{J}}}$

Finally, ${\mathbf{f}}$ is estimated as the posterior expectation

$\begin{gathered} {\mathbf{\hat f}} = \mathbb{E}\left. {\mathbf{f}} \right|{\mathbf{x}},{\mathbf{z}},{\sigma _0}^2 = \int\limits_{{\mathbb{R}^ + }} {\mathbb{E}\left. {\mathbf{f}} \right|\varepsilon ,{\mathbf{x}},{\mathbf{z}},{\sigma _0}^2p\left( {\left. \varepsilon \right|{\mathbf{x}},{\mathbf{z}},{\sigma _0}^2} \right){\text{d}}\varepsilon } = \hfill \\ \frac{1}{Z}\int\limits_{{\mathbb{R}^ + }} {{\varepsilon ^{m - 3}}{{\left| {{\varepsilon ^2}{\mathbf{R}} + {\mathbf{D}}} \right|}^{ - \frac{1}{2}}}e{}^{\frac{1}{2}{{\mathbf{J}}^T}{{\left( {{\varepsilon ^2}{\mathbf{R}} + {\mathbf{D}}} \right)}^{ - 1}}{\mathbf{J}}}{{\left( {{\varepsilon ^2}{\mathbf{R}} + {\mathbf{D}}} \right)}^{ - 1}}{\mathbf{J}}{\text{d}}} \varepsilon \hfill \\ \end{gathered}$

or simply as

${\mathbf{\hat f}} = {\left( {{{\hat \varepsilon }^2}{\mathbf{R}} + {\mathbf{D}}} \right)^{ - 1}}{\mathbf{J}}$

if $\hat \varepsilon$ is an estimate of $\varepsilon$.

No problem at all up to this point, that works very well.

Now, consider the same problem with the sum of two (or several) functions $f$ and $g$ of two different variables $x$ and $y$ on two possibly different domains ${\Omega _f}$ and ${\Omega _g}$ respectively

$Z = f\left( X \right) + g\left( Y \right) + \xi$

to be estimated in the same way from samples ${\left( {{x_i},{y_i},{z_i}} \right)_{i = 1,N}} \triangleq \left( {{\mathbf{x}},{\mathbf{y}},{\mathbf{z}}} \right)$ .

Let’s stack the unknown parameters to be estimated in this way

$\Theta = {\left( {{{\mathbf{f}}^T},{{\mathbf{g}}^T}} \right)^T}$

The regularization matrix becomes the block diagonal matrix

${\mathbf{R}} = \left( {\begin{array}{*{20}{c}} {\left. {\underline {\, {{\varepsilon _f}^2{{\mathbf{R}}_f}} \,}}\! \right| }&0 \\ 0&{\left| \!{\overline {\, {{\varepsilon _g}^2{{\mathbf{R}}_g}} \,}} \right. } \end{array}} \right)$

where ${{\mathbf{R}}_f}$ and ${{\mathbf{R}}_g}$ resp. ${\varepsilon _f}$ and ${\varepsilon _g}$ are the regularization matrices resp. hyperparameters for functions $f$ and $g$ respectively. The data matrix becomes the symmetric block matrix

$\begin{gathered} {\mathbf{D}} = \hfill \\ {\sigma _0}^{ - 2}\left( \begin{gathered} \begin{array}{*{20}{c}} {\left| {\left\{ {i\backslash {x_i} = {x^1}} \right\}} \right|}&0&0 \\ 0& \ddots &0 \\ 0&{\quad \quad \quad \;0\;\quad \quad \quad }&{\left| {\left\{ {i\backslash {x_i} = {x^m}} \right\}} \right|} \end{array}\begin{array}{*{20}{c}} {\left| {\left\{ {i\backslash {x_i} = {x^1},{y_i} = {y^1}} \right\}} \right|}& \cdots &{\left| {\left\{ {i\backslash {x_i} = {x^1},{y_i} = {y^n}} \right\}} \right|} \\ \vdots &{}& \vdots \\ {\left| {\left\{ {i\backslash {x_i} = {x^m},{y_i} = {y^1}} \right\}} \right|}& \cdots &{\left| {\left\{ {i\backslash {x_i} = {x^m},{y_i} = {y^n}} \right\}} \right|} \end{array} \\ \begin{array}{*{20}{c}} {\left| {\left\{ {i\backslash {x_i} = {x^1},{y_i} = {y^1}} \right\}} \right|}& \cdots &{\left| {\left\{ {i\backslash {x_i} = {x^m},{y_i} = {y^1}} \right\}} \right|} \\ \vdots &{}& \vdots \\ {\left| {\left\{ {i\backslash {x_i} = {x^1},{y_i} = {y^n}} \right\}} \right|}& \cdots &{\left| {\left\{ {i\backslash {x_i} = {x^m},{y_i} = {y^n}} \right\}} \right|} \end{array}\begin{array}{*{20}{c}} {\left| {\left\{ {i\backslash {y_i} = {y^1}} \right\}} \right|}&0&0 \\ 0& \ddots &0 \\ 0&{\quad \quad \quad \;0\quad \quad \quad \;}&{\left| {\left\{ {i\backslash {y_i} = {y^n}} \right\}} \right|} \end{array} \\ \end{gathered} \right) \hfill \\ \end{gathered}$

and the vector ${\mathbf{J}}$ becomes

${{\mathbf{J}}^T} = \sigma _0^{ - 2}\left( {\sum\limits_{\left\{ {i\backslash {x_i} = {x^1}} \right\}} {{z_i}} ,\sum\limits_{\left\{ {i\backslash {x_i} = {x^2}} \right\}} {{z_i}} ,...,\sum\limits_{\left\{ {i\backslash {x_i} = {x^n}} \right\}} {{z_i}} ,\sum\limits_{\left\{ {i\backslash {y_i} = {y^1}} \right\}} {{z_i}} ,\sum\limits_{\left\{ {i\backslash {y_i} = {y^2}} \right\}} {{z_i}} ,...,\sum\limits_{\left\{ {i\backslash {y_i} = {y^n}} \right\}} {{z_i}} } \right)$

Of course, now the system of $mn$ linear equations

$\forall i = 1,m\forall j = 1,n\quad f\left( {{x^i}} \right) + g\left( {{y^j}} \right) = {z_{i,j}}$

is underdetermined with rank deficiency $1$ because the functions are determined only up to an additive constant. We can make it well determined by discarding one unknown. This can be done in different ways:

• By fixing one particular value of one function, e.g. $f\left( {{x^1}} \right) = 0 \Leftrightarrow p\left( {f\left( {{x^1}} \right)} \right) = \delta \left( {f\left( {{x^1}} \right)} \right)$ or $g\left( {{x^1}} \right) = 0$ , etc.;

• By constraining one function to have zero mean, e.g. $\sum\limits_{i = 1}^n {f\left( {{x^i}} \right)} = 0$ and substituting one value, for instance $f\left( {{x^1}} \right) = - \sum\limits_{k = 2}^n {f\left( {{x^k}} \right)}$;

• …

For instance, if $f\left( {{x^1}} \right) = 0$ then $\Theta = {\left( {f\left( {{x^2}} \right),...,f\left( {{x^m}} \right),g\left( {{y^1}} \right),...,g\left( {{y^n}} \right)\,} \right)^T}$ and

• the regularization matrix becomes ${{\mathbf{R}}_f} = {{\mathbf{L}}_f}^T{{\mathbf{L}}_f}$ with a centered finite difference at ${x^2}$

${{\mathbf{L}}_f} = \frac{1}{{{{\left( {\Delta x} \right)}^2}}}\left( {\,\begin{array}{*{20}{c}} { - 2}&1&0&0&0& \cdots &0 \\ 1&{ - 2}&1&0&0& \cdots & \vdots \\ 0&1&{ - 2}&1&0& \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots &0 \\ \vdots & \ddots &0&1&{ - 2}&1&0 \\ \vdots & \cdots &0&0&1&{ - 2}&1 \\ 0& \cdots &0&1&4&{ - 5}&2 \end{array}} \right)$

• the data matrix ${\mathbf{D}}$ becomes

$\begin{gathered} {\mathbf{D}} = \hfill \\ {\sigma _0}^{ - 2}\left( \begin{gathered} \begin{array}{*{20}{c}} {\left| {\left\{ {i\backslash {x_i} = {x^2}} \right\}} \right|}&0&0 \\ 0& \ddots &0 \\ 0&{\quad \quad \quad \;0\;\quad \quad \quad }&{\left| {\left\{ {i\backslash {x_i} = {x^m}} \right\}} \right|} \end{array}\begin{array}{*{20}{c}} {\left| {\left\{ {i\backslash {x_i} = {x^2},{y_i} = {y^1}} \right\}} \right|}& \cdots &{\left| {\left\{ {i\backslash {x_i} = {x^2},{y_i} = {y^n}} \right\}} \right|} \\ \vdots &{}& \vdots \\ {\left| {\left\{ {i\backslash {x_i} = {x^m},{y_i} = {y^1}} \right\}} \right|}& \cdots &{\left| {\left\{ {i\backslash {x_i} = {x^m},{y_i} = {y^n}} \right\}} \right|} \end{array} \\ \begin{array}{*{20}{c}} {\left| {\left\{ {i\backslash {x_i} = {x^2},{y_i} = {y^1}} \right\}} \right|}& \cdots &{\left| {\left\{ {i\backslash {x_i} = {x^m},{y_i} = {y^1}} \right\}} \right|} \\ \vdots &{}& \vdots \\ {\left| {\left\{ {i\backslash {x_i} = {x^2},{y_i} = {y^n}} \right\}} \right|}& \cdots &{\left| {\left\{ {i\backslash {x_i} = {x^m},{y_i} = {y^n}} \right\}} \right|} \end{array}\begin{array}{*{20}{c}} {\left| {\left\{ {i\backslash {y_i} = {y^1}} \right\}} \right|}&0&0 \\ 0& \ddots &0 \\ 0&{\quad \quad \quad \;0\quad \quad \quad \;}&{\left| {\left\{ {i\backslash {y_i} = {y^n}} \right\}} \right|} \end{array} \\ \end{gathered} \right) \hfill \\ \end{gathered}$

• and the vector ${\mathbf{J}}$ becomes

${{\mathbf{J}}}^T = \sigma _0^{ - 2}\left( {\sum\limits_{\left\{ {i\backslash {x_i} = {x^2}} \right\}} {{z_i}} ,\sum\limits_{\left\{ {i\backslash {x_i} = {x^3}} \right\}} {{z_i}} ,...,\sum\limits_{\left\{ {i\backslash {x_i} = {x^n}} \right\}} {{z_i}} ,\sum\limits_{\left\{ {i\backslash {y_i} = {y^1}} \right\}} {{z_i}} ,\sum\limits_{\left\{ {i\backslash {y_i} = {y^2}} \right\}} {{z_i}} ,...,\sum\limits_{\left\{ {i\backslash {y_i} = {y^n}} \right\}} {{z_i}} } \right)$

In any case, the new matrix ${{\mathbf{R}}} + {{\mathbf{D}}}$ of dimension $m + n - 1$ is non singular for any ${\varepsilon _f},{\varepsilon _g} > 0$ . However, unless I’m mistaken, none of those approaches work: $p\left( {\left. {{\varepsilon _f},{\varepsilon _g}} \right|{\mathbf{x}},{\mathbf{y}},{\mathbf{z}},{\sigma _0}^2} \right)$, $p\left( {\left. {{\varepsilon _f}} \right|{\mathbf{x}},{\mathbf{y}},{\mathbf{z}},{\sigma _0}^2} \right)$ and $p\left( {\left. {{\varepsilon _g}} \right|{\mathbf{x}},{\mathbf{y}},{\mathbf{z}},{\sigma _0}^2} \right)$ look pretty good but the functions $f$ and $g$ are finally not properly estimated (up to additive constant), especially the one on which the constraint was added. It seems like there is a bad interaction between the improper regularization prior and the proper additional constraint.

I see no fundamental reason why functions $f$ and $g$ could not be properly estimated up to additive constant, the problem looks well posed. So please, what’s wrong here? Is there a right way to remove the underdetermination and to finally estimate those functions properly? I keep on investigating but any help would be appreciated.