Reject Inference: Your Data is Deceiving You

zoukankan html css js c++ java

Reject Inference: Your Data is Deceiving You
Keyword: Reject Inference
Suppose there is a dataset of several attributes, including working conditions, credit history, and property, that have been provided by the bank. The sample classified the customers according to whether they paid off their loans on time. Those who paid off their loans on time were classified as “good customers”, and those who did not pay off their loans on time were classified as “bad customers”.
If Rick, an employee of the bank, uses this dataset to do data analysis directly, what will happen?
Take one of these attributes as an example.

1 : unemployed
2 : skilled employee
3 : management/ highly qualified employee/ officer

Which of these three groups of people, by instinct, should have the best credit? Most people would think it is the second or the third category. However, the data give us a different answer.

As the data shows, the first group of customers is “better than” the third group of customers. After looking at the data, Rick might reach the conclusion that lending more money to the unemployed people is better than lending money to those who are the highly qualified employee, officer, or management board. Is it correct? Let’s think about it a little bit.
Let’s review the process of collecting data:
1. Rick’s Customer applies for a personal loan
2. If it is approved, go to step 3. Otherwise, it will not be counted as a data point in Rick’s data set.
3. If a customer pays off the loan on time, he will be labeled as a “Good Customer”. Otherwise, he will be labeled as a “Bad Customer”.
Before collecting data, there is a crucial step - Step 2. That is to say, the customers who are collected by Rick have already been selected by the bank. Those who applied for a personal loan but didn’t get approved are not in this dataset.
Here I would like to ask you a question: which has the greater risk, jumping from the 4th floor or the 70th floor? (Please do not try it, it is just an example.) You may reply immediately: “The 70th floor, of course!”

You are wrong. I am not asking about the probability of death. I am asking about risk. Suppose someone will offer you 10 billion if you can jump from 70th floor without dying, then you probably won’t bet with him. However, suppose someone will offer you 10 billion if you can jump from 4th floor without dying, then you might want to give it a shot because you know you may not die.
The customers who make the bank feel like jumping from the 70th floor, are most likely rejected by the bank from the beginning. The bank usually has a hard time to make decisions on the application of the customers who make the bank feel like jumping from the 4th floor.

“The 70th floor” customers are likely existing in the first group of customers. So if the bank approved their application, then there must be some reasons support the bank to believe they will pay off their loans. If the bank approved every first-group customer’s application, the data may be different from current data.
Using the data analysis before didn't really understand the meaning of the data may result in you are deceived by your data.
There are lots of factors should be taken into consideration in an evaluation, but I have to simplify the explanation here. If there are any mistakes or anything make you uncomfortable, please let me know so that I can fix it.
查看全文

相关阅读:
log4j配置只打印指定jar或包的DEBUG信息
 实现cookie跨域访问
 使用轻量级Spring @Scheduled注解执行定时任务
 Docker容器里时间与宿主机不同步
 Wildfly8 更改response header中的Server参数
 JBoss部署项目log4j配置会造成死锁问题，浏览器访问一直pending状态
 json-lib-2.4.jar Bug，json字符串中value为"[value]"结构时，解析为数组，不会解析成字符串
 【转载】分享下多年积累的对JAVA程序员成长之路的总结
 web项目嵌入Jetty运行的两种方式(Jetty插件和自制Jetty服务器)
rabbitmq+haproxy+keepalived实现高可用集群搭建

原文地址：https://www.cnblogs.com/rgvb178/p/9181630.html