I had to adjust my thinking when it comes to logistic regression because it models a probability rather than a mean and it involves the non-linear transformation. In this article, I will explain the log odds interpretation of logistical regression in math, and also run a simple logistical regression model with real data.
The odds of an event are the probability of an event that it happens over the probability that it doesn’t. For example, if the P (success) = 0.8, and P (failure) = 0.2, the odds of success will be 0.8/0.2=4.
We use logistic regression to model a binary outcome variable (y is either 0 or 1). Similar to a Bernoulli random variable, we want to consider the 𝑃 (𝑦=1|𝑥) and 𝑃 (𝑦=0|𝑥). Also, we know that the general linear model specification: 𝐸(𝑦|𝑥)=𝑓(𝑥′𝛽), we can derive the conditional mean in the case of linear regression to be:
𝐸(𝑦|𝑥)=𝑃 (𝑦=1|𝑥)×1+𝑃 (𝑦=0|𝑥)×0=𝑃 (𝑦=1|𝑥)
Therefore, the expectation we are modeling is a probability: 𝑃 (𝑦=1|𝑥). However, if we model it with a linear combination of the independent variable and parameters:𝑃(𝑦=1|𝑥)=𝑥′𝛽, it does not work because probability should be bounded between 0 and 1.
Therefore, we choose a link function 𝑓(𝑥′𝛽) to give values between zero and one. And for logistic regression, the link function we use is a logit link function:
Because the underlying model is a probability, we use maximum likelihood estimation for logistic regression. To construct a likelihood function, it is the same form as the Bernoulli random variable.
Even though logistic regression is mainly used for classification and prediction in machine learning, for the sake of completing this article about using the log odds to interpret logistic regression, I ran a simple logistic regression in Python to get a sense of what the results look like. I used a dataset that contains 4,000+ emails. It includes 57 variables which are features indicating indicators whether an email is spam or not, for example, word_free variable indicates the email contains the keyword “free”. The spam variable is a binary variable showing each email has been tagged as spam or not. I ran the simple code below and printed out the results:
The coefficient of world_free is 1.55. Thus, the odds that an email is spam increase almost exp(1.55) ~= 5 times if that email contains the word free. It’s worth noting that the summary table above also nicely provides p-value and confidence interval (95%). I would like to dig deeper into how Python’s statsmodels library computes the standard error and p-values for my curiosity. Perhaps I will write another article about it later :).
Lastly, I will quickly go over the key assumptions of logistic regression as I did for the OLS linear regression in this article. It is needless to say that knowing the key assumptions underlying the method is important. Also, “assumptions about logistic and linear regression” has been one of the top questions in the data scientist interview.
- The outcome is a binary or dichotomous variable
- Linearity of independent variables and log odds (we already derived log(P/1–P)) = 𝑥′𝛽 in the section above)
- No perfect collinearity among covariates
- the observations are independent of each other
reference: Taddy, Matt. Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions, August 21, 2019.