Description
Homework 1 ISyE 6420 Circuit
1. Circuit.
Figure 1: Components E1, . . . , E5 at operational at time t with probabilities
e
−t
, e−2t
, e−t/2
, e−t/3 and e
−t
, respectively.
The system S consists of five independent elements Ei
, i = 1, . . . , 5, connected as in
Figure 1. Probability that the element Ei
is operational at the end of time interval [0, t] is
given as
pi(t) = e
−λit
, t ≥ 0,
for λ1 = 1, λ2 = 2, λ3 = 1/2, λ4 = 1/3, and λ5 = 1.
(a) Find the probability that the system S will be operational at time t. Plot this
probability as a function of time t. What is this probability for t = 1/2.
(b) Find the probability that component E5 was operational at time t = 1/2, if the
system was operational at that time.
Hint: If you consider (b), it is conditional probability, more precisely, a posterior probability
of the hypothesis H1 : E5 operational at time t, given that system S is operational at t. Thus,
solve part (a) as a total probability with H1 and H2 = Hc
1 as hypotheses. Under the two
hypotheses the system simplifies as in Figure 2 and it is easy to find P(S|H1) and P(S|H2).
Then (b) is just a Bayes formula. The results for arbitrary t will be messy – do not simplify.
For plotting in part (a) take some reasonable interval for time t, say [0, 4].
Figure 2: Left: System under hypothesis H1 : E5 operational; Right: System under hypothesis H2 : E5 not operational.
2. Two Batches.
There are two batches of the same product. In one batch all products
are conforming. The other batch contains 20% non-conforming products. A batch is selected
at random and one randomly selected product from that batch is inspected. The inspected
product was found conforming and was returned back to its batch.
What is the probability that the second product, randomly selected from the same batch,
is found non-conforming?
Hint. This problem uses both Bayes’ rule and Total Probability. The two hypotheses
concern the type of batch. For the first draw the hypotheses are equally likely (the batch is
selected at random), but for the second draw, the probabilities of hypotheses are updated
by the information on the result of the first draw via Bayes rule. Updated probabilities of
hypotheses are then used in the Total Probability Formula for the second draw.
3. Classifier.
In a machine learning classification procedure the items are classified as 1 or
0. Based on a training sample of size 120 in which there are 65 1’s and 55 0’s, the classifier
predicts 70 1’s and 50 0’s. Out of 70 items predicted by the classifier as 1, 52 are correctly
classified.
From the population of items where the proportion of 0-labels is 99% (and 1-labels 1%),
an item is selected at random. What is the probability that the item is of label 1, if the
classifier says it was.
Hint. Think about the following interpretation. If 1 is a specific disease present, 0 no
disease present, and the classifier is a medical test for the disease, then you are asked to
2
find a positive predictive value of a test for a subject coming from population where the
prevalence of the disease is 1%.
3
Homework 2 ISyE 6420 Cell Clusters in 3D Petri Dishes
1. Cell Clusters in 3D Petri Dishes.
The number of cell clusters in a 3D Petri dish
has a Poisson distribution with mean λ = 5. Find the percentage of Petri dishes that have
(a) 0 clusters, (b) at least one cluster, (c) more than 8 clusters, and (d) between 4 and 6
clusters inclusive.
2. Silver-Coated Nylon Fiber.
Silver-coated nylon fiber is used in hospitals for its
anti-static electricity properties, as well as for antibacterial and antimycotic effects. In the
production of silver-coated nylon fibers, the extrusion process is interrupted from time to
time by blockages occurring in the extrusion dyes. The time in hours between blockages, T,
has an exponential E(1/10) distribution, where 1/10 is the rate parameter.
Find the probabilities that
(a) a run continues for at least 10 hours,
(b) a run lasts less than 15 hours, and
(c) a run continues for at least 20 hours, given that it has lasted 10 hours.
If you use software, be careful about the parametrization of exponentials.
3. 2-D Density Tasks.
If
f(x, y) = (
λ
2
e
−λy
, 0 ≤ x ≤ y, λ > 0
0, else
Show that:
(a) marginal distribution fX(x) is exponential E(λ).
(b) marginal distribution fY (y) is Gamma Ga(2, λ).
(c) conditional distribution f(y|x) is shifted exponential, f(y|x) = λe−λ(y−x)
, y ≥ x.
(d) conditional distribution f(x|y) is uniform U(0, y).
4. Nylon Fiber Continued.
In the Exercise 2, the times (in hours) between blockages
of the extrusion process, T, had an exponential E(λ) distribution. Suppose that the rate
parameter λ is unknown, but there are three measurements of interblockage times, T1 = 3,
T2 = 13, and T3 = 8.
(a) How would classical statistician estimate λ?
(b) What is the Bayes estimator of λ if the prior is π(λ) = √
1
λ
, λ > 0.
Hint: In (b) the prior is not a proper distribution, but the posterior is. Identify the posterior
from the product of the likelihood from (a) and the prior, no need to integrate.
2
Homework 3 ISyE 6420 Estimating the Precision
Estimating the Precision Parameter of a Rayleigh Distribution.
If two random
variables X and Y are independent of each other and normally distributed with variances
equal to σ
2
, then the variable R =
√
X2 + Y
2
follows the Rayleigh distribution.
Parameterized with precision parameter ξ =
1
σ2 , the Rayleigh random variable R has a density
f(r) = ξr exp (
−
ξr2
2
)
, r ≥ 0, ξ > 0.
An example of such random variable would be the distance of darts from the target center
in a dart-throwing game where the deviations in the two dimensions of the target plane are
independent and normally distributed.
(a) Assume that the prior on ξ is exponential with the rate parameter λ. Show that the
posterior is gamma Ga
2, λ +
r
2
2
.
(b) Assume that R1 = 3, R2 = 4, R3 = 2, and R4 = 5 are Rayleigh-distributed random
observations representing the distance of a dart from the center. Find the posterior in this
case for the same prior form (a), and give a Bayesian estimate of ξ.
(c) For λ = 1, numerically find 95% Credible Set for ξ .
Hint: In (b) show that if r1, r2, . . . , rn are observed, and the prior on ξ is exponential E(λ),
then the posterior is gamma Ga
n + 1, λ +
1
2
Pn
i=1 r
2
i
.
2. Estimating Chemotherapy Response Rates.
An oncologist believes that 90% of
cancer patients will respond to a new chemotherapy treatment and that it is unlikely that this
proportion will be below 80%. Elicit a beta prior on proportion that models the oncologist’s
beliefs.
Hint: For elicitation of the prior use µ = 0.9, µ − 2σ = 0.8 and expressions for µ and σ
for beta.
During a trial, in 30 patients treated, 22 responded.
(a) What are the likelihood and posterior distributions? What is the Bayes estimator of
the proportion?
(b) Using Octave, R, or Python, find 95% Credible Set for p.
(c) Using Octave, R, or Python, test the hypothesis H0 : p ≥ 4/5 against the alternative
H1 : p < 4/5.
(d) Using WinBUGS, find the Bayes estimator and Credible Set and conduct the test.
Compare WinBUGS results with (a-c).
2
Homework 4 ISyE 6420 The Bounded Normal Mean
1. Metropolis: The Bounded Normal Mean.
Suppose that we have information
that the normal mean θ is bounded between −m and m, for some known number m. In this
case it is natural to elicit a prior on θ with the support on interval [−m, m].
A prior with interesting theoretical properties supported on [−m, m] is Bickel-Levit prior1
,
π(θ) = 1
m
cos2
πθ
2m
!
, − m ≤ θ ≤ m.
Assume that a sample [−2, −3, 4, −7, 0, 4] is observed from normal distribution
f(y|θ) ∝
√
τ exp
−
τ
2
(y − θ)
2
,
with a known precision τ = 1/4. Assume also that the prior on θ is Bickel-Levit, with m = 2.
This combination likelihood/prior does not result in an explicit posterior (in terms of
elementary functions). Construct a Metropolis algorithm that will sample from the posterior
of θ.
(a) Simulate 10,000 observations from the posterior, after discarding first 500 observations
(burn-in), and plot the histogram of the posterior.
(b) Find Bayes estimator of θ, and 95% equitailed Credible Set based on the simulated
observations.
Suggestions:
(i) Take uniform distribution on [−m, m] as a proposal distribution since it is easy to
sample from. This is an independence proposal, the proposed θ
′ does not depend on the
current value of the chain, θ.
1When m is large, this prior is an approximation of the least favorable distribution in a Bayes-Minimax
problem with class of priors limited to symmetric and unimodal distributions.
(ii) You will need to calculate Pn
i=1(yi − θ)
2
for current θ and Pn
i=1(yi − θ
′
)
2
for the
proposed θ
′
, prior to calculating the Metropolis ratio.
Gibbs Sampler and High/Low Protein Diet in Rats. Armitage and Berry (1994,
p. 111)2
report data on the weight gain of 19 female rats between 28 and 84 days after birth.
The rats were placed in randomized manner on diets with high (12 animals) and low (7
animals) protein content.
High protein Low protein
134 70
146 118
104 101
119 85
124 107
161 132
107 94
83
113
129
97
123
We want to test the hypothesis on dietary effect: Did a low protein diet result in a
significantly lower weight gain?
The classical t test against the one sided alternative will be significant at 5% significance
level, but we will not go in there. We will do the test Bayesian way using Gibbs sampler.
Assume that high-protein diet measurements y1i
, i = 1, . . . , 12 are coming from normal
distribution N (θ1, 1/τ1), where τ1 is the precision parameter,
f(y1i
|θ1, τ1) ∝ τ
1/2
1
exp
−
τ1
2
(y1i − θ1)
2
, i = 1, . . . , 12.
The low-protein diet measurements y2i
, i = 1, . . . , 7 are coming from normal distribution
N (θ2, 1/τ2),
f(y2i
|θ2, τ2) ∝ τ
1/2
2
exp
−
τ2
2
(y2i − θ2)
2
, i = 1, . . . , 7.
Assume that θ1 and θ2 have normal priors N (θ10, 1/τ10) and N (θ20, 1/τ20), respectively. Take
prior means as θ10 = θ20 = 110 (apriori no preference) and precisions as τ10 = τ20 = 1/100.
Assume that τ1 and τ2 have the gamma Ga(a1, b1) and Ga(a2, b2) priors with shapes
a1 = a2 = 0.01 and rates b1 = b2 = 4.
(a) Construct Gibbs sampler that will sample θ1, τ1, θ2, and τ2 from their posteriors.
2Armitage, P. and Berry, G. (1994). Statistical Methods in Medical Research (3rd edition). Blackwell
2
(b) Find sample differences θ1 − θ2 . Proportion of positive differences approximates the
posterior probability of hypothesis H0 : θ1 > θ2. What is this proportion if the number of
simulations is 10,000, with burn-in of 500?
(c) Using sample quantiles find the 95% equitailed credible set for θ1 − θ2. Does this set
contain 0?
Hint: No WinBUGS should be used (except maybe to check your results). Use Octave
(MATLAB), or R, or Python here. You may want to consult Handout GIBBS.pdf from the
course web repository.
3
Homework 5 ISyE 6420 Cross-validating
1. Cross-validating a Bayesian Regression.
In this exercise covariates x1 and x2 are
simulated as1
x1 = rand(1, 40); x2 = floor(10 * rand(1,40)) + 1;
and the response variable y is obtained as
y = 2 + 6 * x1 – 0.5 * x2 + 0.8*randn(size(x1));
Write a WinBUGS program that takes 20 triples (x1, x2, y) to train the linear regression
model ˆy = b0 + b1x1 + b2x2 and then uses the remaining 20 triples to evaluate the model
by comparing the original responses yi
, i = 21, . . . , 40, with regression-predicted values
yˆi
, i = 21, . . . , 40.
The comparison would involve calculating the MSE, the mean of (yi −
yˆi)
2
, i = 21, . . . , 40.
This is an example of how a cross-validation methodology is often employed to assess
statistical models.
How do the Bayesian estimators of β0, β1, β2, and σ compare to the “true” values 2, 6,
−0.5, and 0.8?
2. Body Fat from Linear Regression.
Excess adiposity is a risk factor for a range
of diseases, leading to increased morbidity and mortality. Body fat (BF) can be measured
1Or in Python or R:
import numpy as np
x1 = np.random.uniform(0,1, 40)
x2 = np.floor(10 * np.random.uniform(0,1,40)) + 1
y = 2 + 6 * x1 – 0.5 * x2 + 0.8*np.random.normal(0,1,len(x1))
====
x1 <- runif(40)
x2 <- floor(10 * runif(40)) + 1
y <- 2 + 6 * x1 – 0.5 * x2 + 0.8*rnorm(length(x1))
by several techniques such as skin-fold measurements bioelectrical impedance analysis (BIA)
and dual-energy X-ray absorptiometry (DEXA). Most of these techniques are not used in
the clinical practice or they are not adequate when large populations are considered.
Fuster-Para et al. (2015)2
compare several linear models for predicting the body fat (BF)
from Age, Body Mass Index (BMI), Body Adiposity Index (BAI) and Gender.
Data set RegBF.csv|xlsx provides data on Age (in years), Body Adiposity Index (BAI),
Body Mass Index (BMI), Body Fat (BF), and Gender (0 for males and 1 for females), of 3,200
adults from Mallorca (Spain). To save you some time a starter file BFReg.odc is provided.
Percentage of body fat mass was obtained by Tetrapolar Bioelectrical Impedance Analysis
(BIA) system (BF-350, Tanita Corp, Tokyo, Japan). The BAI is defined as
hip circumference in cm
(height in m)1.5 − 18.
We are interested in predicting BF from Age, BAI, BMI, Gender, and BB. BB is a new
variable defined as BB = BAI * BMI, and as such, describes the interaction between BAI and
BMI.
(a) Suggest two models: first with all predictors, and the second with single best predictor.
Explain how did you choose the best predictor.
(b) A new person is to be evaluated using the two models from (a). The covariates are:
Age = 35, BAI=26, BMI=20, Gender = 0, BB=520. What are the predicted BF’s from the
two models.
3. Shocks.
An experiment was conducted (Peter Lee, 2009; Dalziel et al., 1941) to assess
the effect of small electrical currents on farm animals, with the eventual goal of understanding
the effects of high-voltage power lines on livestock.
The experiment was carried out with
seven cows using six shock intensities, 0, 1, 2, 3, 4, and 5 milliamps (shocks on the order
of 15 milliamps are painful for many humans). Each cow was given 30 shocks, 5 at each
intensity, in random order.
The entire experiment was then repeated, so each cow received
a total of 60 shocks. For each shock the response, mouth movement, was either present or
absent. The data as quoted give the total number of responses, out of 70 trials, at each
shock level. We ignore cow differences and differences between blocks (experiments).
2Fuster-Parra, P., Bennasar-Veny, M., Tauler, P., Ya˜nez, A., L´opez-Gonz´alez, A. A., and Antoni Aguil´o,
A. (2015). A comparison between multiple regression models and CUN-BAE equation to predict body fat
in adults. PLOS One, DOI:10.1371/journal.pone.0122291.
2
Current Number of Number of Proportion of
(milliamps) x responses y trials n responses p
0 0 70 0.000
1 9 70 0.129
2 21 70 0.300
3 47 70 0.671
4 60 70 0.857
5 63 70 0.900
Using logistic regression and noninformative priors on its parameters, estimate the proportion of responses after a shock of 2.5 milliamps. Find 95% credible set for the population
proportion.
3
ISyE 6420 Homework 6 Cancer of Tongue
1. Cancer of Tongue.
Sickle-Santanello et al (1988)1 provide data on 80 males diagnosed
with cancer of the tongue. Data are provided in the file tongue.csv|dat|xlsx.
The variables
in the dataset are as follows:
• Tumor DNA profile (1 – aneuploid tumor, 2 – diploid tumor);
• Time to death or on-study time (in weeks); and
• Censoring indicator (0=censored, 1=observed)
Fit the regression with tumor profile as covariate. What is the 95% Credible Set for the
slope β1?
2. Airfreight Breakage with Missing Data.
A substance used in biological and medical
research is shipped by air freight to users in cartons of 2,000 ampules. The data below,
involving 15 shipments, were collected on the number of times a carton was transferred from
one aircraft to another over the shipment route (X) and the number of ampules found to be
broken upon arrival (Y ).
X 2 1 0 2 NA 3 1 0 1 2 3 0 1 NA NA
Y NA 16 9 17 12 22 13 8 NA 19 17 11 10 20 2
(a) Using OpenBUGS/WinBUGS, fit Y by Poisson regression, with X as a covariate.
Report the deviance of your fit.
(b) According to your model, how many packages on average are expected will be broken
if the number of shipment routes is X = 4? What is 95% CS for your estimate.
(c) For a particular shipment sent from Shenzhen you learned that it would involve X = 4
shipping routes. Predict the number of broken packages. What is here different from (b)?
(d) What are estimates for unobserved X5, X14, X15, Y1 and Y9?
Hint: Note that for missing X, you need to specify the distribution. It could be any nonnegative valued distribution, but since X is discrete, a good choice is Poisson(2), as in
for(i in 1:n){
x[i] ~ dpois(2)}
1Sickle-Santanello, B. J., Farrar, W. B., DeCenzo, J. F., Keyhani-Rofagha, S., Klein, J., Pearl, D.,
Laufman, H., and O’Toole R. V. (1988). Technical and statistical improvements for flow cytometric DNA
analysis of paraffin-embedded tissue. Cytometry, 9, 594–599.
1