Y-Variable as a Yes/No Variable
• A bank is interested in determining whether certain borrowers are creditworthy. – Should loans be made to these potential customers? The bank has a variety of information about each potential burrower and would like a way to classify the individual as being a qualifying or nonqualifying candidate.
• Director of personnel for a corporation would like to know which of a group of new trainees will be successful in a certain position at the company. Thus, the director wants to classify trainees into two groups: those who will succeed and those who will not.
• Business wants to classify customers as returning or nonreturning.
• A university wishes to study the characteristics of students who have graduated with a degree and those who failed to graduate.
Layth C. Alwan, Ph.D.
Y-Variable as a Yes/No Variable
• All these situations relate to the situation when the response variable (Y variable) is binary. Popular examples of binary response outcomes are success/failure, win/lose, buy/don’t buy, satisfied/not satisfied, default/don’t default, and survive/die.
• Ordinary regression works on the basis that the Y variable is a continuous variable. We will see in this section that ordinary least-squares regression is not best suited for modeling binary outcomes. We will explore an alternative method, known as logistic regression, which is designed for dealing with binary outcomes.
Layth C. Alwan, Ph.D.
File: employee
Layth C. Alwan, Ph.D.
The equation says: Predicted Group = −5.360352 + 0.0663426(Test).
Does this make sense? Suppose someone had a test score of 90. The equation predicts 0.6105.
What would you guess is the chance that the employee will be satisfactory in performance? It seems like 61.05%. This implies that the predictive model is modeling probability of being satisfactory.
Layth C. Alwan, Ph.D.
Problems…
• What is the probability a person will be successful if he/she scores 80 on the test? Answer: −0.0529.
This is obviously an impossible probability.
• What is the probability a person will be successful if he/she scores 97 on the test? Answer: 1.0749.
This is obviously an impossible probability.
Layth C. Alwan, Ph.D.
1.25
1.00
Group
0.75
0.50
0.25
0.00
80
85
90
Test
95
100
We might try to fix this problem just by setting all negative probabilities to 0 and all probabilities greater than 1 to 1.
This avoids one problem, but it can’t really be correct. Being satisfactory can’t be absolutely certain (p = 1) or absolutely impossible (p = 0) based only on test score.
Real-world changes are intuitively most likely to be smooth. There is no reason to believe that there are sharp changes at certain score values (i.e., at the corners).
Layth C. Alwan, Ph.D.
This is what we are striving for…
1.0
Y-Data
0.8
0.6
0.4
0.2
0.0
80
85
90
Test
95
100
Layth C. Alwan, Ph.D.
There are many ways to get a smooth curve. One of the most common is the logistic curve. The regression based on this curve is called logistic regression.
݈݊
= ߚ + ߚଵ ݔ
1−
Odds in favor of a success. For example, when the probability of success, p, = 1/3, we get a ratio of (1/3)/(2/3) = ½. We say that the odds in favor of success are 1:2 which says for every 2 failures we expect 1 success. Logistic regression models the logarithm of the odds as a linear function of x.
Layth C. Alwan, Ph.D.
݈݊
= ߚ + ߚଵ ݔ
1−
=
1+
1
݁ ି(ఉబ ାఉభ ௫ )
= ̂
1
1 + ݁ ି(బ ାభ ௫ )
Layth C. Alwan, Ph.D.
Maximum Likelihood Estimation
• Define ݂ ܻ as the probability of observing a Y outcome (0 or 1) given the X observation.
݂ 1 = =
1
1 + ݁ ି(ఉబାఉభ ௫ )
݂ 0 = 1 − = 1 −
1
1 + ݁ ି(ఉబାఉభ ௫ )
Layth C. Alwan, Ph.D.
Maximum Likelihood Estimation
• Assuming independence, the probability of observing a set of Y outcomes would be the product of the individual probabilities.
• For example, if the probability of observing a Y = 1 is 0.6, then the probability of observing the