zoukankan      html  css  js  c++  java
  • Notes

    Notes - Berkerly Statistics 2.1X - Week2

    -Week2, 2014/03/06, hphp

    欢迎交流、转载,转载请注明出处~

    Week2. Location , Represents of data

    Summarizing data can help us understand them, especially when the number of data is large. This chapter presents several ways to summarize quantitative data by a typical value(a measure of location, such as the mean, median, or mode) and a measure of how well the typical value represents the list (a measure of spread, such as the range, inter-quartile range, or standard deviation). Markov's and Chebychev's inequalities show that these summary measures can contain a surprisingly large amount of information about the data. 


    Lecture 3.1 The median and the mode

    Measures of location

    • Measures of location do just that: They try to capture with a single number what is typical of the data.

    Mean , Median , Mode.

    • Median:The median is the number that divides the (ordered) data in half—thesmallestnumber that is at least as big as half the data. At least half the data are equal to or smaller than the median, and at least half the data are equal to or greater than the median. 

            EG.list:1, 2, 3, 4

            median: -- > 2

            1/4 th: -- > 1

            3/4 th: -- > 3

    • However, the mean, the median, and the mode are "as close as possible" to all the data: Foreach of these three measures of location, the sum of the distances between each datum and the measure of location is as small as it can be. The differences among the three measures of location are in how "distance" is defined.[1]

    • The mean, median, and mode can berelated (approximately) to the histogram: loosely speaking, the mode is the highest bump, the median is where half the area is to the right and half is to the left, and the mean is where the histogram would balance, were it a solid object cut out of a uniform block of metal. (All these heuristics are approximate, and depend on the class intervals.)

    • [datum : 数据]

    • [Symmetric Distribution - average , balanced.]

    The center 

    • Median : the "half point" of the data" --- > 31.4 mm

    The Mode: The "most common" value

    • the value has the highest frequency

                4 | 8

                5 | 9

                6 |3337

                7 |000235

                8 | 012345788

                9 | 015556

                10| 0

                

                6|333

                7|000

                9|555

    A unimodal distribution 

    • Unimodal : one peak


    Lecture 3.2 The average

    average - mean

    The average - not center , not even a member , not variable members.

    not so many difference with what i have already understood.


    Lecture 3.3 Comparing and combining averages

    What's the relation between these groups

    [Natinal Health and Nutrition Examination 1999-2000]   [noting the data and the source.]

    • the data are not longitudinal, but are cross sectional.

    Comparing the numbers

    • the average of diff groups :"how are the groups related to each other"

    E.G.


    ave
    section160
    section270

    cant tell the average , because the lack of information.


    • avesection size
      section16020
      section27030

    average = total/50 

    • weighted average of averages


    avesection size
    section proportion
    section160202/5
    section270303/5

    average = 60*2/5 + 70*3/5

    average = SUM(average[i]*weigth[i]) [weights are the section proportions.]


    Lecture 3.4 The average and the histogram; The average and the median.

    the median is unaffected by outliers.

    [ Statistics that are not affected too much by small subsets of the data are resistant. The median is resistant; the mean is not. ]

    A right-skewed distribution : average is greater than the median.

    incomes

    [affluent-rich,enrich]

    [gizmos and gadget-创意和配件]

    [disingenuously - 狡猾]

    [pledge to - 承诺]

    [Articles report median incomes. instead of average income.]

    What does an average test score tell u.

    • if a lot of people did not get good scores , the histogram will get : Left-hand tail.

    The average and the histogram

    • list : 2, 3, 3, 4

    average = [ (1*2) + (2*3) + (1*4) ]/4  = 1/4*2 + 2/4*3 + 1/4*4

    1/4,2/4,1/4 --> the percent/ proportions..

    • list : 2, 3, 3, 7

    average = [ (1*2) + (2*3) + (1*4) ]/4  = 1/4*2 + 2/4*3 + 1/4*4

    1/4,2/4,1/4 --> the percent/ proportions..

    • the average is the center of gravity of the histogram

    1/4,2/4,1/4:weights


    Lecture 3.5 Markov's inequality

    How far can u be above average , How big can the tail be

    • Andrey Markov(1856-1922)
    • The average of a group people is 20years, What proportion are more than 80 years old.

    • Markov's inequality: 

    If a list has only non-negative entries , then the proportion of entries are at least at large as k times the average is at most 1/k.

    [could use the Sum( weight*value ) as a prove.]

    • taking care of the edge
    Question: more than 80 years old: > 80
    Markov: more than or equal 80 years old : >= 80

    • But , if k = 0.5 , the biggest proportion will be 200% , makes no sense though.


    Lecture 4.1 How the average/other represents data

         Measures of location summarize what is typical of elements of a list,    but not every element is typical.    Are all the elements close to each other?    Are most of the elements close to each other?    What is the biggest difference between elements?    On the average, how far are the elements from each other?    Measures of spread or variability tell us.

    The three most common measures of spreador variability are the    range,    theinterquartile range (IQR),    and thestandard    deviation (SD)

    The range of a list is the largest value minus the smallest value.

    It is the width of the smallest interval that    contains all the data, so it measures spread.    It is notresistant,    because changing just one datum can make it arbitrarily large. 

    Range and interquartile range.

    • How far are these data from the center.
    • Spread 
    • IQR : Inter quartile range


    The middle 50% data are spread over 8 years.


    Lecture 4.2 Standard Deviation

    Deviation from average: roughly how far are the numbers from their average?

    • list : 2, 3, 3, 4, 4, 5, 6, 7 average = 4.25
    • deviations: 2.25, 1.25, 1.25, 0.25, 0.25, -0.75, -1.75, -2.75    --->    the average of deviations is 0.
    • BUT absolute values does not have good math properties.

    Standard Deviation

    • Root mean square of deviation from the average --- Rms????

    The rms (root mean square) of a list measures the average size of its entries. It is defined as follows:

    rms = square-root( (sum of the squares of the entries)/(number of entries) )

    =[ (sum of squares of the entries)/(number of entries) ]½.

    • How does the sd are measured or representitive for a list of data ?

    $List: 2, 3, 3, 4, 4, 5, 6, 7    average = 4.25

    variance = mean square  of deviation from the average 

    SD = root 2.44 = 1.56 $

    The average and sd use the same units.

    ---> SD is the measure spread of the data.

    the measure spread of the data

    • The interval average +- SD is roughly [2.75, 5.75]
    • It picks up a good chunk of the list, but not all.



    Lecture 4.3 Properties of the SD:Chebychev's inequality

    In a nutshell

    Rough statement : No matter what the list , tha vast majority of entries will be in the range average +- a_few_SDs.

    • Chebycheff(19 centry)
    • Chebycheff's inequality:


    • Precise statement:

    No mater what the list , a proportion of at least 1-1/k^2 of the entries will be in the range average +/- k*SD

    Prove

    Instinctly , if the proportion of data that > average + k*SD are bigger than 1/k^2, than , the SD will get larger.


    FootPrints

    [1]. meaning of distances for "Mean, Median, Mode":

        For the mean, the distance between two numbers is defined to be the square of their difference. 

        That is, the sum of the squares of the differences between the data and the mean is smaller than the sum of squares of the differences between the data and any other number. (Equivalently, the rms or root mean square of the differences from the mean is smaller than the rms of the list of differences from any other number—the rms is defined and discussed below.)

        For the median, the distance between two numbers is defined to be the absolute value of their difference. That is, the sum of the absolute values of the differences between a median and the data is no larger than the sum of the absolute values of the differences between any other number and the data. 

        For the mode, the distance between two numbers is defined to be zero if the numbers are equal, and one if they are not equal. That is, the number of data that differ from a mode is no larger than the number of data that differ from any other value. Equivalently, a mode is a number from which the fewest possible data differ: a "most common" value. 





  • 相关阅读:
    新一轮人工智能的兴起引发的思考
    企业应用架构的发展演进
    利用poi插件,把Excel内容读入Java,把Java中的内容输出到Exce
    mysql免安装被指
    正则表达式大全
    开发数据库步骤
    JVM
    Java面试题一
    java集合总结
    JAVA WEB回顾一
  • 原文地址:https://www.cnblogs.com/hphp/p/3584133.html
Copyright © 2011-2022 走看看