| Bernoulli distribution | |||
|---|---|---|---|
|
Probability mass function Three examples of Bernoulli distribution:
P
(
x
=
0
)
=
0
.
2
{\displaystyle P(x=0)=0{.}2}
P
(
x
=
0
)
=
0
.
8
{\displaystyle P(x=0)=0{.}8}
P
(
x
=
0
)
=
0
.
5
{\displaystyle P(x=0)=0{.}5}
| |||
| Parameters |
0
≤
p
≤
1
{\displaystyle 0\leq p\leq 1}
| ||
| Support |
k
∈
{
0
,
1
}
{\displaystyle k\in \{0,1\}}
| ||
| PMF |
{
q
=
1
−
p
if
k
=
0
p
if
k
=
1
{\displaystyle {\begin{cases}q=1-p&{\text{if }}k=0\\p&{\text{if }}k=1\end{cases}}}
| ||
| CDF |
{
0
if
k
<
0
1
−
p
if
0
≤
k
<
1
1
if
k
≥
1
{\displaystyle {\begin{cases}0&{\text{if }}k<0\\1-p&{\text{if }}0\leq k<1\\1&{\text{if }}k\geq 1\end{cases}}}
| ||
| Mean |
p
{\displaystyle p}
| ||
| Median |
{
0
if
p
<
1
/
2
[
0
,
1
]
if
p
=
1
/
2
1
if
p
>
1
/
2
{\displaystyle {\begin{cases}0&{\text{if }}p<1/2\\\left[0,1\right]&{\text{if }}p=1/2\\1&{\text{if }}p>1/2\end{cases}}}
| ||
| Mode |
{
0
if
p
<
1
/
2
0
,
1
if
p
=
1
/
2
1
if
p
>
1
/
2
{\displaystyle {\begin{cases}0&{\text{if }}p<1/2\\0,1&{\text{if }}p=1/2\\1&{\text{if }}p>1/2\end{cases}}}
| ||
| Variance |
p
(
1
−
p
)
=
p
q
{\displaystyle p(1-p)=pq}
| ||
| MAD |
2
p
(
1
−
p
)
=
2
p
q
{\displaystyle 2p(1-p)=2pq}
| ||
| Skewness |
q
−
p
p
q
{\displaystyle {\frac {q-p}{\sqrt {pq}}}}
| ||
| Excess kurtosis |
1
−
6
p
q
p
q
{\displaystyle {\frac {1-6pq}{pq}}}
| ||
| Entropy |
−
q
ln
q
−
p
ln
p
{\displaystyle -q\ln q-p\ln p}
| ||
| MGF |
q
+
p
e
t
{\displaystyle q+pe^{t}}
| ||
| CF |
q
+
p
e
i
t
{\displaystyle q+pe^{it}}
| ||
| PGF |
q
+
p
z
{\displaystyle q+pz}
| ||
| Fisher information |
1
p
q
{\displaystyle {\frac {1}{pq}}}
| ||
| Part of a series on statistics |
| Probability theory |
|---|
In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,[1] is the discrete probability distribution of a random variable which takes the value 1 with probability
p
{\displaystyle p}
and the value 0 with probability
q
=
1
−
p
{\displaystyle q=1-p}
. Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question. Such questions lead to outcomes that are Boolean-valued: a single bit whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q. It can be used to represent a (possibly biased) coin toss where 1 and 0 would represent "heads" and "tails", respectively, and p would be the probability of the coin landing on heads (or vice versa where 1 would represent tails and p would be the probability of tails). In particular, unfair coins would have
p
≠
1
/
2.
{\displaystyle p\neq 1/2.}
The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution). It is also a special case of the two-point distribution, for which the possible outcomes need not be 0 and 1.[2]
Properties
[edit]If
X
{\displaystyle X}
is a random variable with a Bernoulli distribution, then:
Pr
(
X
=
1
)
=
p
,
Pr
(
X
=
0
)
=
q
=
1
−
p
.
{\displaystyle {\begin{aligned}\Pr(X{=}1)&=p,\\\Pr(X{=}0)&=q=1-p.\end{aligned}}}
The probability mass function
f
{\displaystyle f}
of this distribution, over possible outcomes k, is[3]
f
(
k
;
p
)
=
{
p
if
k
=
1
,
q
=
1
−
p
if
k
=
0.
{\displaystyle f(k;p)={\begin{cases}p&{\text{if }}k=1,\\q=1-p&{\text{if }}k=0.\end{cases}}}
This can also be expressed as
f
(
k
;
p
)
=
p
k
(
1
−
p
)
1
−
k
for
k
∈
{
0
,
1
}
{\displaystyle f(k;p)=p^{k}(1-p)^{1-k}\quad {\text{for }}k\in \{0,1\}}
or as
f
(
k
;
p
)
=
p
k
+
(
1
−
p
)
(
1
−
k
)
for
k
∈
{
0
,
1
}
.
{\displaystyle f(k;p)=pk+(1-p)(1-k)\quad {\text{for }}k\in \{0,1\}.}
The Bernoulli distribution is a special case of the binomial distribution with
n
=
1.
{\displaystyle n=1.}
[4]
The kurtosis goes to infinity for high and low values of
p
,
{\displaystyle p,}
but for
p
=
1
/
2
{\displaystyle p=1/2}
the two-point distributions including the Bernoulli distribution have a lower excess kurtosis, namely −2, than any other probability distribution.
The Bernoulli distributions for
0
≤
p
≤
1
{\displaystyle 0\leq p\leq 1}
form an exponential family.
The maximum likelihood estimator of
p
{\displaystyle p}
based on a random sample is the sample mean.
Mean
[edit]The expected value of a Bernoulli random variable
X
{\displaystyle X}
is
E
[
X
]
=
p
{\displaystyle \operatorname {E} [X]=p}
This is because for a Bernoulli distributed random variable
X
{\displaystyle X}
with
Pr
(
X
=
1
)
=
p
{\displaystyle \Pr(X{=}1)=p}
and
Pr
(
X
=
0
)
=
q
{\textstyle \Pr(X{=}0)=q}
we find[3]
E
[
X
]
=
Pr
(
X
=
1
)
⋅
1
+
Pr
(
X
=
0
)
⋅
0
=
p
⋅
1
+
q
⋅
0
=
p
.
{\displaystyle {\begin{aligned}\operatorname {E} [X]&=\Pr(X{=}1)\cdot 1+\Pr(X{=}0)\cdot 0\\[1ex]&=p\cdot 1+q\cdot 0\\[1ex]&=p.\end{aligned}}}
Variance
[edit]The variance of a Bernoulli distributed
X
{\displaystyle X}
is
Var
[
X
]
=
p
q
=
p
(
1
−
p
)
{\displaystyle \operatorname {Var} [X]=pq=p(1-p)}
We first find
E
[
X
2
]
=
Pr
(
X
=
1
)
⋅
1
2
+
Pr
(
X
=
0
)
⋅
0
2
=
p
⋅
1
2
+
q
⋅
0
2
=
p
=
E
[
X
]
{\displaystyle {\begin{aligned}\operatorname {E} [X^{2}]&=\Pr(X{=}1)\cdot 1^{2}+\Pr(X{=}0)\cdot 0^{2}\\&=p\cdot 1^{2}+q\cdot 0^{2}\\&=p=\operatorname {E} [X]\end{aligned}}}
From this follows[3]
Var
[
X
]
=
E
[
X
2
]
−
E
[
X
]
2
=
E
[
X
]
−
E
[
X
]
2
=
p
−
p
2
=
p
(
1
−
p
)
=
p
q
{\displaystyle {\begin{aligned}\operatorname {Var} [X]&=\operatorname {E} [X^{2}]-\operatorname {E} [X]^{2}=\operatorname {E} [X]-\operatorname {E} [X]^{2}\\[1ex]&=p-p^{2}=p(1-p)=pq\end{aligned}}}
With this result it is easy to prove that, for any Bernoulli distribution, its variance will have a value inside
[
0
,
1
/
4
]
{\displaystyle [0,1/4]}
.
Skewness
[edit]The skewness is
q
−
p
p
q
=
1
−
2
p
p
q
{\displaystyle {\frac {q-p}{\sqrt {pq}}}={\frac {1-2p}{\sqrt {pq}}}}
. When we take the standardized Bernoulli distributed random variable
X
−
E
[
X
]
Var
[
X
]
{\displaystyle {\frac {X-\operatorname {E} [X]}{\sqrt {\operatorname {Var} [X]}}}}
we find that this random variable attains
q
p
q
{\displaystyle {\frac {q}{\sqrt {pq}}}}
with probability
p
{\displaystyle p}
and attains
−
p
p
q
{\displaystyle -{\frac {p}{\sqrt {pq}}}}
with probability
q
{\displaystyle q}
. Thus we get
γ
1
=
E
[
(
X
−
E
[
X
]
Var
[
X
]
)
3
]
=
p
⋅
(
q
p
q
)
3
+
q
⋅
(
−
p
p
q
)
3
=
1
p
q
3
(
p
q
3
−
q
p
3
)
=
p
q
p
q
3
(
q
2
−
p
2
)
=
(
1
−
p
)
2
−
p
2
p
q
=
1
−
2
p
p
q
=
q
−
p
p
q
.
{\displaystyle {\begin{aligned}\gamma _{1}&=\operatorname {E} \left[\left({\frac {X-\operatorname {E} [X]}{\sqrt {\operatorname {Var} [X]}}}\right)^{3}\right]\\&=p\cdot \left({\frac {q}{\sqrt {pq}}}\right)^{3}+q\cdot \left(-{\frac {p}{\sqrt {pq}}}\right)^{3}\\&={\frac {1}{{\sqrt {pq}}^{3}}}\left(pq^{3}-qp^{3}\right)\\&={\frac {pq}{{\sqrt {pq}}^{3}}}(q^{2}-p^{2})\\&={\frac {(1-p)^{2}-p^{2}}{\sqrt {pq}}}\\&={\frac {1-2p}{\sqrt {pq}}}={\frac {q-p}{\sqrt {pq}}}.\end{aligned}}}
Higher moments and cumulants
[edit]The raw moments are all equal because
1
k
=
1
{\displaystyle 1^{k}=1}
and
0
k
=
0
{\displaystyle 0^{k}=0}
.
E
[
X
k
]
=
Pr
(
X
=
1
)
⋅
1
k
+
Pr
(
X
=
0
)
⋅
0
k
=
p
⋅
1
+
q
⋅
0
=
p
=
E
[
X
]
.
{\displaystyle \operatorname {E} [X^{k}]=\Pr(X{=}1)\cdot 1^{k}+\Pr(X{=}0)\cdot 0^{k}=p\cdot 1+q\cdot 0=p=\operatorname {E} [X].}
The central moment of order
k
{\displaystyle k}
is given by
μ
k
=
(
1
−
p
)
(
−
p
)
k
+
p
(
1
−
p
)
k
.
{\displaystyle \mu _{k}=(1-p)(-p)^{k}+p(1-p)^{k}.}
The first six central moments are
μ
1
=
0
,
μ
2
=
p
(
1
−
p
)
,
μ
3
=
p
(
1
−
p
)
(
1
−
2
p
)
,
μ
4
=
p
(
1
−
p
)
(
1
−
3
p
(
1
−
p
)
)
,
μ
5
=
p
(
1
−
p
)
(
1
−
2
p
)
(
1
−
2
p
(
1
−
p
)
)
,
μ
6
=
p
(
1
−
p
)
(
1
−
5
p
(
1
−
p
)
(
1
−
p
(
1
−
p
)
)
)
.
{\displaystyle {\begin{aligned}\mu _{1}&=0,\\\mu _{2}&=p(1-p),\\\mu _{3}&=p(1-p)(1-2p),\\\mu _{4}&=p(1-p)(1-3p(1-p)),\\\mu _{5}&=p(1-p)(1-2p)(1-2p(1-p)),\\\mu _{6}&=p(1-p)(1-5p(1-p)(1-p(1-p))).\end{aligned}}}
The higher central moments can be expressed more compactly in terms of
μ
2
{\displaystyle \mu _{2}}
and
μ
3
{\displaystyle \mu _{3}}
μ
4
=
μ
2
(
1
−
3
μ
2
)
,
μ
5
=
μ
3
(
1
−
2
μ
2
)
,
μ
6
=
μ
2
(
1
−
5
μ
2
(
1
−
μ
2
)
)
.
{\displaystyle {\begin{aligned}\mu _{4}&=\mu _{2}(1-3\mu _{2}),\\\mu _{5}&=\mu _{3}(1-2\mu _{2}),\\\mu _{6}&=\mu _{2}(1-5\mu _{2}(1-\mu _{2})).\end{aligned}}}
The first six cumulants are
κ
1
=
p
,
κ
2
=
μ
2
,
κ
3
=
μ
3
,
κ
4
=
μ
2
(
1
−
6
μ
2
)
,
κ
5
=
μ
3
(
1
−
12
μ
2
)
,
κ
6
=
μ
2
(
1
−
30
μ
2
(
1
−
4
μ
2
)
)
.
{\displaystyle {\begin{aligned}\kappa _{1}&=p,\\\kappa _{2}&=\mu _{2},\\\kappa _{3}&=\mu _{3},\\\kappa _{4}&=\mu _{2}(1-6\mu _{2}),\\\kappa _{5}&=\mu _{3}(1-12\mu _{2}),\\\kappa _{6}&=\mu _{2}(1-30\mu _{2}(1-4\mu _{2})).\end{aligned}}}
Entropy and Fisher's Information
[edit]Entropy
[edit]Entropy is a measure of uncertainty or randomness in a probability distribution. For a Bernoulli random variable
X
{\displaystyle X}
with success probability
p
{\displaystyle p}
and failure probability
q
=
1
−
p
{\displaystyle q=1-p}
, the entropy
H
(
X
)
{\displaystyle H(X)}
is defined as:
H
(
X
)
=
E
p
ln
1
Pr
(
X
)
=
−
Pr
(
X
=
0
)
ln
Pr
(
X
=
0
)
−
Pr
(
X
=
1
)
ln
Pr
(
X
=
1
)
=
−
(
q
ln
q
+
p
ln
p
)
.
{\displaystyle {\begin{aligned}H(X)&=\mathbb {E} _{p}\ln {\frac {1}{\Pr(X)}}\\[1ex]&=-\Pr(X{=}0)\ln \Pr(X{=}0)-\Pr(X{=}1)\ln \Pr(X{=}1)\\[1ex]&=-(q\ln q+p\ln p).\end{aligned}}}
The entropy is maximized when
p
=
0.5
{\displaystyle p=0.5}
, indicating the highest level of uncertainty when both outcomes are equally likely. The entropy is zero when
p
=
0
{\displaystyle p=0}
or
p
=
1
{\displaystyle p=1}
, where one outcome is certain.
Fisher's Information
[edit]Fisher information measures the amount of information that an observable random variable
X
{\displaystyle X}
carries about an unknown parameter
p
{\displaystyle p}
upon which the probability of
X
{\displaystyle X}
depends. For the Bernoulli distribution, the Fisher information with respect to the parameter
p
{\displaystyle p}
is given by:
I
(
p
)
=
1
p
q
{\displaystyle I(p)={\frac {1}{pq}}}
Proof:
- The Likelihood Function for a Bernoulli random variable
X
{\displaystyle X}
is: L ( p ; X ) = p X ( 1 − p ) 1 − X {\displaystyle L(p;X)=p^{X}(1-p)^{1-X}}
This represents the probability of observing X {\displaystyle X}
given the parameter p {\displaystyle p}
.
- The Log-Likelihood Function is:
ln
L
(
p
;
X
)
=
X
ln
p
+
(
1
−
X
)
ln
(
1
−
p
)
{\displaystyle \ln L(p;X)=X\ln p+(1-X)\ln(1-p)}
- The Score Function (the first derivative of the log-likelihood with respect to
p
{\displaystyle p}
is: ∂ ∂ p ln L ( p ; X ) = X p − 1 − X 1 − p {\displaystyle {\frac {\partial }{\partial p}}\ln L(p;X)={\frac {X}{p}}-{\frac {1-X}{1-p}}}
- The second derivative of the log-likelihood function is:
∂
2
∂
p
2
ln
L
(
p
;
X
)
=
−
X
p
2
−
1
−
X
(
1
−
p
)
2
{\displaystyle {\frac {\partial ^{2}}{\partial p^{2}}}\ln L(p;X)=-{\frac {X}{p^{2}}}-{\frac {1-X}{(1-p)^{2}}}}
- Fisher information is calculated as the negative expected value of the second derivative of the log-likelihood:
I
(
p
)
=
−
E
[
∂
2
∂
p
2
ln
L
(
p
;
X
)
]
=
−
(
−
p
p
2
−
1
−
p
(
1
−
p
)
2
)
=
1
p
(
1
−
p
)
=
1
p
q
{\displaystyle {\begin{aligned}I(p)=-E\left[{\frac {\partial ^{2}}{\partial p^{2}}}\ln L(p;X)\right]=-\left(-{\frac {p}{p^{2}}}-{\frac {1-p}{(1-p)^{2}}}\right)={\frac {1}{p(1-p)}}={\frac {1}{pq}}\end{aligned}}}
It is maximized when
p
=
0.5
{\displaystyle p=0.5}
, reflecting maximum uncertainty and thus maximum information about the parameter
p
{\displaystyle p}
.
Related distributions
[edit]- If
X
1
,
…
,
X
n
{\displaystyle X_{1},\dots ,X_{n}}
are independent, identically distributed (i.i.d.) random variables, all Bernoulli trials with success probability p, then their sum is distributed according to a binomial distribution with parameters n and p: ∑ k = 1 n X k ∼ B ( n , p ) {\displaystyle \sum _{k=1}^{n}X_{k}\sim \operatorname {B} (n,p)}
(binomial distribution).[3]
- The categorical distribution is the generalization of the Bernoulli distribution for variables with any constant number of discrete values.
- The Beta distribution is the conjugate prior of the Bernoulli distribution.[5]
- The geometric distribution models the number of independent and identical Bernoulli trials needed to get one success.
- If
Y
∼
B
e
r
n
o
u
l
l
i
(
1
2
)
{\textstyle Y\sim \mathrm {Bernoulli} \left({\frac {1}{2}}\right)}
, then 2 Y − 1 {\textstyle 2Y-1}
has a Rademacher distribution.
See also
[edit]- Bernoulli process, a random process consisting of a sequence of independent Bernoulli trials
- Bernoulli sampling
- Binary entropy function
- Binary decision diagram
References
[edit]- ^ Uspensky, James Victor (1937). Introduction to Mathematical Probability. New York: McGraw-Hill. p. 45. OCLC 996937.
- ^ Dekking, Frederik; Kraaikamp, Cornelis; Lopuhaä, Hendrik; Meester, Ludolf (9 October 2010). A Modern Introduction to Probability and Statistics (1 ed.). Springer London. pp. 43–48. ISBN 9781849969529.
- ^ a b c d Bertsekas, Dimitri P. (2002). Introduction to Probability. Tsitsiklis, John N., Τσιτσικλής, Γιάννης Ν. Belmont, Mass.: Athena Scientific. ISBN 188652940X. OCLC 51441829.
- ^ McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and Hall/CRC. Section 4.2.2. ISBN 0-412-31760-5.
- ^ Orloff, Jeremy; Bloom, Jonathan. "Conjugate priors: Beta and normal" (PDF). math.mit.edu. Retrieved October 20, 2023.
Author's mention
[edit]- Abhirath, dwivedi; Kotz, Samuel; Kemp, Adrienne W. (1993). Univariate Discrete Distributions (2nd ed.). Wiley. ISBN 0-471-54897-9.
- Peatman, John G. (1963). Introduction to Applied Statistics. New York: Harper & Row. pp. 162–171.
External links
[edit]- "Binomial distribution", Encyclopedia of Mathematics, EMS Press, 2001 [1994].
- Weisstein, Eric W. "Bernoulli Distribution". MathWorld.
- Interactive graphic: Univariate Distribution Relationships.