censoredData

  |   Source
In [2]:
import matplotlib.pyplot as pl
import scipy.special as sp
import numpy as np

%matplotlib inline

It is usual in Astrophysics to have upper/lower limits detections on a certain quantity that one is measuring. When this happens, it is usual to give this upper/lower limit as the measured quantity. One option to deal with these measurements is to drop them from the analysis. But they might contain important information for the solution to our problem so it makes sense to include them in the analysis.

Assume the following generative model: $$y_i = z_i + e_i$$

where $z_i$ is our predicted value and $e_i$ is a source of uncertainty. We will represent the probability distribution for proposition $Z_i$ using the function

$$p(Z_i) = f_Z(z_i)$$

while the same can be done for the source of uncertainty:

$$p(E_i) = f_E(e_i)$$

The likelihood for $Y_i$ can be obtained by dealing with the joint distribution of $Y_i$, $Z_i$ and $E_i$ and applying marginalization and standard probability calculus:

$$p(Y_i) = \int dZ_i dE_i p(Y_i,Z_i,E_i) = \int dZ_i dE_i p(Y_i|E_i,Z_i) p(E_i) p(Z_i)$$

Given that $y_i = z_i + e_i$, $p(Y_i|E_i,Z_i) = \delta(y_i-z_i-e_i)$, so that:

$$p(Y_i) = \int dz_i f_Z(z_i) \int de_i f_E(e_i) \delta (y_i-z_i-e_i) = \int dz_i f_Z(z_i) f_E(y_i-z_i)$$

Therefore, the likelihood is the convolution of the distribution for the error and the distribution for our predicted value.

Let us consider the case of a variable that is censored from below, so that $z_i=z_i^\star$ when $z_i^\star > \tau$ and $z_i=\tau$ when $z_i

$z_i^\star > \tau$

In this case, we have

$$f_E(y_i-z_i) = N(y_i-z_i|0,\sigma^2)$$

where $N(x|\mu,\sigma^2)$ is the normal distribution for variable $x$ with mean $\mu$ and variance $\sigma^2$. Concerning the distribution of the predicted value, we have

$$f_Z(z_i) = \delta(z_i-y^\star)$$

Computing the convolution integral, we find the standard result:

$$p(Y_i)=N(y_i|y^\star,\sigma^2)$$

$z_i^\star

In this case, we again have

$$f_E(y_i-z_i) = N(y_i-z_i|0,\sigma^2)$$

but then

$$f_Z(z_i) = H(z_i|a,\tau)$$

where $H(x|a,b)$ is the step function that equals $1/(b-a)$ on the interval $[a,b]$ and zero otherwise. Taking this into account, we can compute the likelihood as:

$$p(Y_i)=\frac{1}{2} \left( \mathrm{erf} \left( \frac{\tau-y^\star}{\sqrt{2} \sigma} \right) - \mathrm{erf} \left( \frac{a-y^\star}{\sqrt{2} \sigma} \right) \right)$$

For instance, assuming that we have $a=0$, we end up with:

$$p(Y_i)=\frac{1}{2} \left( \mathrm{erf} \left( \frac{\tau-y^\star}{\sqrt{2} \sigma} \right) + \mathrm{erf} \left( \frac{y^\star}{\sqrt{2} \sigma} \right) \right)$$

Likewise, if $a \to -\infty$, we have:

$$p(Y_i)=\frac{1}{2} \left( 1 + \mathrm{erf} \left( \frac{\tau-y^\star}{\sqrt{2} \sigma} \right) \right)$$

Here is an example of the likelihood when a positive variable is censored at $\tau=20$ and with several values of the noise variance:

In [3]:
tau = 20
a = 0
sigma = np.asarray([5,10,20])
x = np.linspace(0.0,100.0,100)
fig, ax = pl.subplots()
for s in sigma:
    ax.plot(x, 0.5*(sp.erf((tau-x)/(np.sqrt(2)*s)) - sp.erf((a-x)/(np.sqrt(2)*s))))
In [ ]: