[box type="info" align="" class="" width=""]We bring to you another guest post by Benjamin Rojogan on Logistic regression to aid healthcare sector in reducing patient readmission. Ben's previous post on ensemble methods to optimize machine learning models is also available for a quick read here.[/box]
ER visits are not cheap for any party involved. Whether this be the patient or the insurance company. However, this does not stop some patients from being regular repeat visitors. These recurring visits are due to lack of intervention for problems such as substance abuse, chronic diseases and mental illness. This increases costs for everybody in the healthcare system and reduces quality of care by playing a role in the overflowing of Emergency Departments (EDs).
Research teams at UW and other universities are partnering with companies like Kensci to figure out how to approach the problem of reducing readmission rates. The ability to predict the likelihood of a patient’s readmission will allow for targeted intervention which in turn will help reduce the frequency of readmissions. Thus making the population healthier and hopefully reducing the estimated 41.3 billion USD healthcare costs for the entire system.
A plethora of algorithms are available for data scientists to use to approach this problem. Many possible variables could affect the readmission and medical costs. Also, there are also many different ways researchers might pose their questions. However, the researchers at UW and many other institutions have been heavily focused on reducing the readmission rate simply by trying to calculate whether a person would or would not be readmitted. In particular, this team of researchers was curious about chronic ailments. Patients with chronic ailments are likely to have random flare ups that require immediate attention.
Being able to predict if a patient will have an ER visit can lead to managing the cause more effectively.
One approach taken by the data science team at UW as well as the Department of Family and Community Medicine at the University of Toronto was to utilize logistic regression to predict whether or not a patient would be readmitted. Patient readmission can be broken down into a binary output: either the patient is readmitted or not. As such logistic regression has been a useful model in my experience to approach this problem.
Why do data scientists like to use logistic regression? Where is it used? And how does it compare to other data algorithms?
Logistic regression is a statistical method that statisticians and data scientists use to classify people, products, entities, etc. It is used for analyzing data that produces a binary classification based on one or many independent variables. This means, it produces two clear classifications (Yes or No, 1 or 0, etc).
With the example above, the binary classification would be: is the patient readmitted or not? Other examples of this could be whether to give a customer a loan or not, whether a medical claim is fraud or not, whether a patient has diabetes or not.
Despite its name, logistic regression does not provide the same output like linear regression (per se). There are some similarities, for instance, the linear model is somewhat consistent as you might notice in the equation below where you see what is very similar to a linear equation. But the final output is based on the log odds.
Linear regression and multivariate regression both take one to many independent variables and produce some form of continuous function. Linear regression could be used to predict the price of a house, a person’s age or the cost of a product an e-commerce should display to each customer. The output is not limited to only a few discrete classifications.
Whereas logistic regression produces discrete classifiers. For instance, an algorithm using logistic regression could be used to classify whether or not a certain stock price would be either >$50 a share or <$50 a share. Linear regression would be used to predict if a stock share would be worth $50.01, $50.02….etc.
Logistic regression is a calculation that uses the odds of a certain classification.
In the equation above, the symbol you might know as pi actually represents the odds or probability.
To reduce the error rate, we should predict Y = 1 when p ≥ 0.5 and Y = 0 when p < 0.5. This creates a linear classifier, a boundary that when the coefficients β0 + x · β has a p value that is p < 0.5 then Y = 0.
By generating coefficients that help predict the logit transformation, the method allows to classify for the characteristic of interest.
Now that is a lot of complex math mumbo jumbo. Let’s try to break it down into simpler terms.
Let’s start with probability. Let’s say a patient has the probability of 0.6 of being readmitted. Then the probability that the patient won’t be readmitted is .4.
Now, we want to take this and convert it into odds. This is what the formula above is doing.
You would take .6/.4 and get odds of 1.5. That means the odds of the patient being readmitted are 1.5 to 1. If instead the probability was .5 for both being readmitted and not being readmitted, then the odds would be 1:1.
Now the next step in the logistic regression model would be to take the odds and get the “Log odds”. You do this by taking the 1.5 and put it into the log portion of the equation. Now you will get .18(rounded).
In logistic regression, we don’t actually know p. That is what we are trying to essentially find and model using various coefficients and input variables. Each input provides a value that changes how much more likely an event will or will not occur.
All of these coefficients are used to calculate the log odds. This model can take multiple variables like age, sex, height, etc. and specify how much of an effect each variable has on the odds an event will occur.
Once the initial model is developed, then comes the work of deciding its value. How does a business go from creating an algorithm inside a computer and translate it into action. Some of us like to say the “computers” are the easy part. Personally I find the hard part to be the “people”. After all, at the end of the day, it comes down to business value. Will an algorithm save money or not?
That means it has to be applied in real life. This could take the form of a new initiative, strategy, product recommendation, etc. You need to find the outliers that are worth going after!
For instance, if we go back to the patient readmission example again. The algorithm points out patients with high probabilities of being readmitted. However if the readmission costs are low, they will probably be ignored..sadly. That is how businesses (including hospitals) look at problems.
Logistic regression is a great tool for binary classification. It is unlike many other algorithms that estimate continuous variables or estimate distributions. This statistical method can be utilized to classify whether a person will be likely to get cancer because of environmental variables like proximity to a highway, smoking habits, etc? This method has been used effectively in the medical, financial and insurance industry successfully for a while. Knowing when to use what algorithm takes time. However, the more problems a data scientist faces, the faster they will recognize whether to use logistic regression or decision trees.
Using logistic regression provides the opportunity for healthcare institutions to accurately target at risk individuals who should receive a more tailored behavioral health plan to help improve their daily health habits. This in turn opens the opportunity for better health for patients and lower costs for hospitals.
[box type="shadow" align="" class="" width=""]
Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/box]