信用评分卡 (part 2of 7)_统计和数据挖掘中分类问题

zoukankan html css js c++ java

信用评分卡 (part 2of 7)_统计和数据挖掘中分类问题

python信用评分卡（附代码，博主录制）

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

统计和数据挖掘中分类问题

Classification Problem in Statistics & Data Mining

I must say I was shocked when Amishi, a girl little over three years old, announced that going forward she is only friends with my wife and not me. Her reason for the breakup was that I am a boy and girls can only be friends with girls. She has learned this social norm from her friends at the preschool. I still remember the way she modeled for me in her swimsuit and umbrella just a few months ago. She was aware of the boy-girl difference even then, it is just she has learned this weird social norm now. The point over here is that toddlers can distinguish genders without much effort. Nature has given us a built-in equation to classify gender through a mere glance with a high degree of precision. Imagine a similar mechanism to distinguish between good and bad borrowers. You are talking about every banker’s dream. However, evolution has trained us to mate not to lend.

我必须说，当三十岁的女孩Amishi宣布前进时，她只是与我的妻子而不是我的朋友，我感到震惊。分手的原因是我是男孩，女孩只能是女孩的朋友。她从幼儿园的朋友那里学到了这种社会规范。几个月前，我还记得她在泳衣和雨伞中为我塑造的方式。即便如此，她也意识到了男女之间的差异，现在只是她已经学会了这种奇怪的社会规范。这里的重点是，幼儿可以毫不费力地区分性别。大自然给了我们一个内置的方程式，通过高度精确的一瞥来对性别进行分类。想象一下类似的机制来区分好的和坏的借款人。你在谈论每个银行家的梦想。然而，进化训练我们交配不放贷。

Predictive Analytics: Classification Problem – by Roopam

As I have mentioned in the previous article, scorecards have their roots in the classification problem in statistics and data mining. The idea with most classification problems is to create a mathematical equation to distinguish dichotomous variables. These variables can only take two values such as

• Male/ Female
• Good / Bad
• Yes / No
• God / Devil
• Happy / Sad
• Sales / No Sales

The list can go on until eternity. The reason why most business problems try to model dichotomies is that it is easy to comprehend for us humans. We must appreciate that dichotomies are never absolute and have degrees attached to them. For example, I am 80% good and 20% bad – at least I would like to believe this. I shall keep Pareto’s 80-20 principle away from this i.e. my 20% bad is responsible for my 80% of behavior.

正如我在上一篇文章中提到的，记分卡的根源在于统计和数据挖掘中的分类问题。大多数分类问题的想法是创建一个数学方程来区分二分变量。这些变量只能采用两个值，例如

•男/女
• 好坏
•是/否
•上帝/魔鬼
•快乐/悲伤
•销售/无销售

这份清单可以持续到永恒。大多数商业问题试图模拟二分法的原因是它很容易理解我们人类。我们必须明白，二分法从来都不是绝对的，是有度的。例如，我80％好，20％坏 - 至少我想相信这一点。我将保持帕累托的80-20原则远离这一点，即我的20％不好对我80％的行为负责。

Credit Scorecards Development – Problem Statement & Sampling（坏客户定义是灵活的）

In the case of credit scorecards, the problem statement is to distinguish analytically between the good and bad borrowers. Hence, the first task is to define a good and a bad borrower. For most loan products, good and bad credit is defined in the following way

1. Good loan: never or once missed on the EMI payment
2. Bad loan: ever missed 3 consecutive EMIs in a row (i.e. 90 days-past-due)

Additionally, for tagging someone good or bad, you need to observe his or her behavior for a significant length of time. This length of time varies from product to product based on the tenor of the loan. For home loans, with a tenor of 20 years, 2-3 years is a reasonable observation period.
However, there is nothing sacrosanct about the above definition and can be modified at the discretion of the analyst. Roll-rate analysis and vintage analysis are the two analytical tools you may want to consider while constructing the above definition.

信用记分卡开发 - 问题陈述和抽样
在信用记分卡的情况下，问题陈述是在好的和坏的借款人之间进行分析。因此，第一个任务是定义一个好的和坏的借款人。对于大多数贷款产品，信用良好和不良以下列方式定义

1.良好的贷款：永远或曾一次逾期
2.不良贷款：连续3次错过EMI（即90天过期）

此外，为了标记好人或坏人，你需要在很长一段时间内观察他或她的行为。根据贷款期限，这段时间因产品而异。对于房屋贷款，期限为20年，2 - 3年是合理的观察期。
但是，对于上述定义没有什么神圣不可侵犯的，可以由分析师自行决定修改。滚动率分析和复古分析是您在构建上述定义时可能需要考虑的两种分析工具。

Sampling Strategy for Credit Scorecards

A few years ago, I did a daylong workshop on Statistical Inference for a large German shipping & cargo company in Mumbai. At the time of Q&A session the Vice President of operations asked a tricky question, what is a good sample size to achieve good precision? He was looking for a one-size-fits-all answer and I wish it were that simple. The sample size depends on the degree of similarity or homogeneity of the population in question. For example, what do you think is a good sample size to answer the following two questions?

1. What is the salinity of the Pacific Ocean?
2. Is there another planet with intelligent life in the Universe?

In terms of population size, a number of drops in the ocean and planets in the Universe is similar. A couple of drops of water are enough to answer the first question since the salinity of oceans is fairly constant. On the other hand, the second question is a black swan problem. You may need to visit every single planet to rule our possibility of an intelligent form of life.

For credit scorecard development, the accepted rule of thumb for sample size is at least 1000 records of both good and bad loans. There is no reason why you cannot build a scorecard with a smaller sample size (say 500 records). However, the analyst needs to be cautious in doing so because a higher degree of randomness creeps in a small data sample. Additionally, it is also advisable to keep the sample window as short as possible i.e. a financial quarter or two while scorecard development. Further, the sample is divided into two pieces – usually, 70 % for development and remaining for validation sample. We discuss the development and validation sample in detail in the subsequent sections of this series.

信用记分卡的抽样策略
几年前，我为孟买的一家大型德国航运和货运公司举办了为期一天的统计推断研讨会。在问答环节时，运营副总裁提出了一个棘手的问题，即获得良好精度的样本量是多少？他正在寻找一个通用的答案，我希望它很简单。样本量取决于所讨论的群体的相似程度或同质性。例如，您认为回答以下两个问题的样本量是多少？

1.太平洋的盐度是多少？
2.宇宙中还有另一个拥有智慧生命的星球吗？

就人口规模而言，宇宙中海洋和行星的数量下降是相似的。由于海洋的盐度相当稳定，几滴水足以回答第一个问题。另一方面，第二个问题是黑天鹅问题。您可能需要访问每个星球来统治我们生活的智能生活的可能性。

对于信用记分卡开发，样本大小的公认经验法则是至少1000个好的和坏的贷款记录。没有理由不能建立样本量较小的记分卡（比如500条记录）。但是，分析师需要谨慎行事，因为较小程度的随机性会在小数据样本中蔓延。此外，还建议尽可能缩短样本窗口，即在记分卡开发时用一个或两个季度数据。此外，样品分为两部分 - 通常70％用于显影，剩余用于验证样品。我们将在本系列的后续章节中详细讨论开发和验证示例。

Credit Scorecard Development: Sampling Strategy – by Roopam

Sign-off Note

In the next article, we will discuss an important topic of variables classing and coarse classing for credit scorecards. See you soon

python风控建模实战lendingClub(博主录制，catboost，lightgbm建模，2K超清分辨率)

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

微信扫二维码，免费学习更多python资源

查看全文

相关阅读:
4.Android开发笔记:Activity的生命周期、启动方式、最佳实践
 3.Android开发笔记:Activity 数据传递
 2.Android开发笔记:Activity
1.《Android开发笔记》系列
 JS
Mongodb 学习笔记简介
 Sql Server MySql 日期
 实现tomcat与IIS共用80端口
 学习Microsoft SQL Server 2008技术内幕：T-SQL语法基础--第4章
 学习Microsoft SQL Server 2008技术内幕：T-SQL语法基础

原文地址：https://www.cnblogs.com/webRobot/p/9736379.html

信用评分卡 (part 2of 7)_统计和数据挖掘中分类问题

python信用评分卡（附代码，博主录制）

统计和数据挖掘中分类问题

Classification Problem in Statistics & Data Mining

Credit Scorecards Development – Problem Statement & Sampling（坏客户定义是灵活的）

Sampling Strategy for Credit Scorecards

Sign-off Note