zoukankan      html  css  js  c++  java
  • 信用评分卡 (part 2of 7)_统计和数据挖掘中分类问题

    python信用评分卡(附代码,博主录制)

    统计和数据挖掘中分类问题

    Classification Problem in Statistics & Data Mining

    I must say I was shocked when Amishi, a girl little over three years old, announced that going forward she is only friends with my wife and not me. Her reason for the breakup was that I am a boy and girls can only be friends with girls. She has learned this social norm from her friends at the preschool. I still remember the way she modeled for me in her swimsuit and umbrella just a few months ago. She was aware of the boy-girl difference even then, it is just she has learned this weird social norm now. The point over here is that toddlers can distinguish genders without much effort. Nature has given us a built-in equation to classify gender through a mere glance with a high degree of precision. Imagine a similar mechanism to distinguish between good and bad borrowers. You are talking about every banker’s dream. However, evolution has trained us to mate not to lend.

    我必须说,当三十岁的女孩Amishi宣布前进时,她只是与我的妻子而不是我的朋友,我感到震惊。 分手的原因是我是男孩,女孩只能是女孩的朋友。 她从幼儿园的朋友那里学到了这种社会规范。 几个月前,我还记得她在泳衣和雨伞中为我塑造的方式。 即便如此,她也意识到了男女之间的差异,现在只是她已经学会了这种奇怪的社会规范。 这里的重点是,幼儿可以毫不费力地区分性别。 大自然给了我们一个内置的方程式,通过高度精确的一瞥来对性别进行分类。 想象一下类似的机制来区分好的和坏的借款人。 你在谈论每个银行家的梦想。 然而,进化训练我们交配不放贷。

    Predictive Analytics: Classification Problem – by Roopam

    As I have mentioned in the previous article, scorecards have their roots in the classification problem in statistics and data mining. The idea with most classification problems is to create a mathematical equation to distinguish dichotomous variables. These variables can only take two values such as

    • Male/ Female
    • Good / Bad
    • Yes / No
    • God / Devil
    • Happy / Sad
    • Sales / No Sales

    The list can go on until eternity. The reason why most business problems try to model dichotomies is that it is easy to comprehend for us humans. We must appreciate that dichotomies are  never absolute and have degrees attached to them. For example, I am 80% good and 20% bad – at least I would like to believe this. I shall keep Pareto’s 80-20 principle away from this i.e. my 20% bad is responsible for my 80% of behavior.

    正如我在上一篇文章中提到的,记分卡的根源在于统计和数据挖掘中的分类问题。 大多数分类问题的想法是创建一个数学方程来区分二分变量。 这些变量只能采用两个值,例如

    •男/女
    • 好坏
    •是/否
    •上帝/魔鬼
    •快乐/悲伤
    •销售/无销售

    这份清单可以持续到永恒。 大多数商业问题试图模拟二分法的原因是它很容易理解我们人类。 我们必须明白,二分法从来都不是绝对的,是有度的。 例如,我80%好,20%坏 - 至少我想相信这一点。 我将保持帕累托的80-20原则远离这一点,即我的20%不好对我80%的行为负责。

    Credit Scorecards Development – Problem Statement & Sampling(坏客户定义是灵活的)

    In the case of credit scorecards, the problem statement is to distinguish analytically between the good and bad borrowers. Hence, the first task is to define a good and a bad borrower. For most loan products, good and bad credit is defined in the following way

    1. Good loan: never or once missed on the EMI payment
    2. Bad loan: ever missed 3 consecutive EMIs in a row (i.e. 90 days-past-due)

    Additionally, for tagging someone good or bad, you need to observe his or her behavior for a significant length of time. This length of time varies from product to product based on the tenor of the loan. For home loans, with a tenor of 20 years, 2-3 years is a reasonable observation period.
    However, there is nothing sacrosanct about the above definition and can be modified at the discretion of the analyst. Roll-rate analysis and vintage analysis are the two analytical tools you may want to consider while constructing the above definition.

    信用记分卡开发 - 问题陈述和抽样
    在信用记分卡的情况下,问题陈述是在好的和坏的借款人之间进行分析。因此,第一个任务是定义一个好的和坏的借款人。对于大多数贷款产品,信用良好和不良以下列方式定义

    1.良好的贷款:永远或曾一次逾期
    2.不良贷款:连续3次错过EMI(即90天过期)

    此外,为了标记好人或坏人,你需要在很长一段时间内观察他或她的行为。根据贷款期限,这段时间因产品而异。对于房屋贷款,期限为20年,2 - 3年是合理的观察期。
    但是,对于上述定义没有什么神圣不可侵犯的,可以由分析师自行决定修改。滚动率分析和复古分析是您在构建上述定义时可能需要考虑的两种分析工具。

    Sampling Strategy for Credit Scorecards

    A few years ago, I did a daylong workshop on Statistical Inference for a large German shipping & cargo company in Mumbai. At the time of Q&A session the Vice President of operations asked a tricky question, what is a good sample size to achieve good precision? He was looking for a one-size-fits-all answer and I wish it were that simple. The sample size depends on the degree of similarity or homogeneity of the population in question. For example, what do you think is a good sample size to answer the following two questions?

    1. What is the salinity of the Pacific Ocean?
    2. Is there another planet with intelligent life in the Universe?

    In terms of population size, a number of drops in the ocean and planets in the Universe is similar. A couple of drops of water are enough to answer the first question since the salinity of oceans is fairly constant. On the other hand, the second question is a black swan problem. You may need to visit every single planet to rule our possibility of an intelligent form of life.

    For credit scorecard development, the accepted rule of thumb for sample size is at least 1000 records of both good and bad loans. There is no reason why you cannot build a scorecard with a smaller sample size (say 500 records). However, the analyst needs to be cautious in doing so because a higher degree of randomness creeps in a small data sample. Additionally, it is also advisable to keep the sample window as short as possible i.e. a financial quarter or two while scorecard development. Further, the sample is divided into two pieces – usually, 70 % for development and remaining for validation sample. We discuss the development and validation sample in detail in the subsequent sections of this series.

    信用记分卡的抽样策略
    几年前,我为孟买的一家大型德国航运和货运公司举办了为期一天的统计推断研讨会。在问答环节时,运营副总裁提出了一个棘手的问题,即获得良好精度的样本量是多少?他正在寻找一个通用的答案,我希望它很简单。样本量取决于所讨论的群体的相似程度或同质性。例如,您认为回答以下两个问题的样本量是多少?

    1.太平洋的盐度是多少?
    2.宇宙中还有另一个拥有智慧生命的星球吗?

    就人口规模而言,宇宙中海洋和行星的数量下降是相似的。由于海洋的盐度相当稳定,几滴水足以回答第一个问题。另一方面,第二个问题是黑天鹅问题。您可能需要访问每个星球来统治我们生活的智能生活的可能性。

    对于信用记分卡开发,样本大小的公认经验法则是至少1000个好的和坏的贷款记录。没有理由不能建立样本量较小的记分卡(比如500条记录)。但是,分析师需要谨慎行事,因为较小程度的随机性会在小数据样本中蔓延。此外,还建议尽可能缩短样本窗口,即在记分卡开发时用一个或两个季度数据。此外,样品分为两部分 - 通常70%用于显影,剩余用于验证样品。我们将在本系列的后续章节中详细讨论开发和验证示例。

    Credit Scorecard Development: Sampling Strategy – by Roopam

    Sign-off Note

    In the next article, we will discuss an important topic of variables classing and coarse classing for credit scorecards. See you soon

    python风控建模实战lendingClub(博主录制,catboost,lightgbm建模,2K超清分辨率)

    https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

     微信扫二维码,免费学习更多python资源

     
  • 相关阅读:
    一起talk C栗子吧(第九十回:C语言实例--使用管道进行进程间通信三)
    集群技术(三)MySQL集群深度解析
    ZOJ 3609 Modular Inverse(扩展欧几里德)
    8,16小感
    Dagger2----一个最简单的Dagger2依赖的实现
    android:模拟水波效果的自己定义View
    SQL Server 运行计划操作符具体解释(1)——断言(Assert)
    参数类型 (实体类层)eneity或pojo 常用参数类型
    参数类型 (@Controller层)
    参数类型 (@Service层) impl
  • 原文地址:https://www.cnblogs.com/webRobot/p/9736379.html
Copyright © 2011-2022 走看看