zoukankan      html  css  js  c++  java
  • 学习笔记之Machine Learning Crash Course | Google Developers

    Machine Learning Crash Course  |  Google Developers

    • https://developers.google.com/machine-learning/crash-course/
    • Google's fast-paced, practical introduction to machine learning

    ML Concepts


    Introduction to Machine Learning

    • As you'll discover, machine learning requires a different mindset than other programming problems.
      • For example, real-world machine learning focuses far more on data analysis than on coding.

    Framing

    • Key ML Terminology
      • Example is a particular instance of data, x
      • Labeled example has {features, label}: (xy)
        • Used to train the model
      • Unlabeled example has {features, ?}: (x?)
        • Used for making predictions on new data 
      • Model maps examples to predicted labels: y'
        • Defined by internal parameters, which are learned
      • regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:
        • What is the value of a house in California?
        • What is the probability that a user will click on this ad?
      • classification model predicts discrete values. For example, classification models make predictions that answer questions like the following:
        • Is a given email message spam or not spam?
        • Is this an image of a dog, a cat, or a hamster?
      • The labels applied to some examples might be unreliable.
        • Definitely. It's important to check how reliable your data is. The labels for this dataset probably come from email users who mark particular email messages as spam. Since most users do not mark every suspicious email message as spam, we may have trouble knowing whether an email is spam. Furthermore, spammers could intentionally poison our model by providing faulty labels.
      • "Shoe beauty" is not a useful feature.
        • Good features are concrete and quantifiable. Beauty is too vague a concept to serve as a useful feature. Beauty is probably a blend of certain concrete features, such as style and color. Style and color would each be better features than beauty.

    Descending into ML

    • Linear regression is a method for finding the straight line or hyperplane that best fits a set of points. This module explores linear regression intuitively before laying the groundwork for a machine learning approach to linear regression.
    • Linear Regression
      • By convention in machine learning, you'll write the equation for a model slightly differently:
        • y′=b+w1x1
        • where:
          • y′ is the predicted label (a desired output).
          • b is the bias (the y-intercept), sometimes referred to as w0.
          • w1 is the weight of feature 1. Weight is the same concept as the "slope" m in the traditional equation of a line.
          • x1 is a feature (a known input).
      • To infer (predict) the temperature y′ for a new chirps-per-minute value x1, just substitute the x1 value into this model.
      • Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows:
        • y′=b+w1x1+w2x2+w3x3
    • Training and Loss
      • Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.
      • Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
      • The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:
        • = the square of the difference between the label and the prediction
        • = (observation - prediction(x))2
        • = (y - y')2
      • Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
        • MSE = 1 / N * ∑(x,y)∈D (y − prediction(x))2
        • where:
          • (x,y) is an example in which
          • x is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
          • y is the example's label (for example, temperature).
          • prediction(x) is a function of the weights and bias in combination with the set of features x.
          • D is a data set containing many labeled examples, which are (x,y) pairs.
          • N is the number of examples in D.
      • Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

    Reducing Loss

    • To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.
    • How do we reduce loss?
      • Hyperparameters are the configuration settings used to tune how the model is trained.
      • Derivative of (y - y')2 with respect to the weights and biases tells us how loss changes for a given example

        • Simple to compute and convex

      • So we repeatedly take small steps in the direction that minimizes loss

        • We call these Gradient Steps (But they're really negative Gradient Steps)

        • This strategy is called Gradient Descent

    • Weight Initialization
      • For convex problems, weights can start anywhere (say, all 0s)
        • Convex: think of a bowl shape
        • Just one minimum
      • Foreshadowing: not true for neural nets
        • Non-convex: think of an egg crate
        • More than one minimum
        • Strong dependency on initial values
    • SGD & Mini-Batch Gradient Descent
      • Could compute gradient over entire data set on each step, but this turns out to be unnecessary
      • Computing gradient on small data samples works well
        • On every step, get a new random sample
      • Stochastic Gradient Descent: one example at a time
      • Mini-Batch Gradient Descent: batches of 10-1000
        • Loss & gradients are averaged over the batch
    • An Iterative Approach
      • Key Point: A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.
    • Gradient Descent
      • Note: When performing gradient descent, we generalize the above process to tune all the model parameters simultaneously. For example, to find the optimal values of both w1 and the bias b, we calculate the gradients with respect to both w1 and b. Next, we modify the values of w1 and b based on their respective gradients. Then we repeat these steps until we reach minimum loss.
    • Learning Rate
      • As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point.
      • Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate that is too small, learning will take too long
      • Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong
      • There's a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.
    • Optimizing Learning Rate
      • NOTE: In practice, finding a "perfect" (or near-perfect) learning rate is not essential for successful model training. The goal is to find a learning rate large enough that gradient descent converges efficiently, but not so large that it never converges.
    • Stochastic Gradient Descent
      • In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration.
      • By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent(SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.
      • Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.
      • When performing gradient descent on a large data set, which of the following batch sizes will likely be more efficient?
      • A small batch or even a batch of one example (SGD).
      • Amazingly enough, performing gradient descent on a small batch or even a batch of one example is usually more efficient than the full batch. After all, finding the gradient of one example is far cheaper than finding the gradient of millions of examples. To ensure a good representative sample, the algorithm scoops up another random small batch (or batch of one) on every iteration.

    First Steps with TensorFlow

    • TensorFlow API Hierarchy
    Hierarchy of TensorFlow toolkits. Estimator API is at the top.
    • A Quick Look at the tf.estimator API
     1 import tensorflow as tf
     2 
     3 # Set up a linear classifier.
     4 classifier = tf.estimator.LinearClassifier(feature_columns)
     5 
     6 # Train the model on some example data.
     7 classifier.train(input_fn=train_input_fn, steps=2000)
     8 
     9 # Use it to predict.
    10 predictions = classifier.predict(input_fn=predict_input_fn)
    • Toolkit
      • Tensorflow is a computational framework for building machine learning models. TensorFlow provides a variety of different toolkits that allow you to construct models at your preferred level of abstraction. You can use lower-level APIs to build models by defining a series of mathematical operations. Alternatively, you can use higher-level APIs (like tf.estimator) to specify predefined architectures, such as linear regressors or neural networks.
      • The following table summarizes the purposes of the different layers:
    Toolkit(s)Description
    Estimator (tf.estimator) High-level, OOP API.
    tf.layers/tf.losses/tf.metrics Libraries for common model components.
    TensorFlow Lower-level APIs
      • TensorFlow consists of the following two components:
      • These two components are analogous to Python code and the Python interpreter. Just as the Python interpreter is implemented on multiple hardware platforms to run Python code, TensorFlow can run the graph on multiple hardware platforms, including CPU, GPU, and TPU.
      • Which API(s) should you use? You should use the highest level of abstraction that solves the problem. The higher levels of abstraction are easier to use, but are also (by design) less flexible. We recommend you start with the highest-level API first and get everything working. If you need additional flexibility for some special modeling concerns, move one level lower. Note that each level is built using the APIs in lower levels, so dropping down the hierarchy should be reasonably straightforward.
    • tf.estimator API
      • tf.estimator is compatible with the scikit-learn API. Scikit-learn is an extremely popular open-source ML library in Python, with over 100k users, including many at Google.

    Generalization

    • Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.
    • The Big Picture

    Cycle of model, prediction, sample, discovering true distribution, more sampling

      • Goal: predict well on new data drawn from (hidden) true distribution.
      • Problem: we don't see the truth.
        • We only get to sample from it.
      • If model h fits our current sample well, how can we trust it will predict well on other new samples?
    • How Do We Know If Our Model Is Good?
      • Theoretically:
        • Interesting field: generalization theory
        • Based on ideas of measuring model simplicity / complexity
      • Intuition: formalization of Ockham's Razor principle
        • The less complex a model is, the more likely that a good empirical result is not just due to the peculiarities of our sample
      • Empirically:
        • Asking: will our model do well on a new sample of data?
        • Evaluate: get a new sample of data-call it the test set
        • Good performance on the test set is a useful indicator of good performance on the new data in general:
          • If the test set is large enough
          • If we don't cheat by using the test set over and over
    • The ML Fine Print
      • Three basic assumptions in all of the above:
      1. We draw examples independently and identically (i.i.d.) at random from the distribution
      2. The distribution is stationary: It doesn't change over time

      3. We always pull from the same distribution: Including training, validation, and test sets

    • Peril of Overfitting
      • An overfit model gets a low loss during training but does a poor job predicting new data.
      • As you'll see later on, overfitting is caused by making a model more complex than necessary. The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.
      • The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.
      • In modern times, we've formalized Ockham's razor into the fields of statistical learning theory and computational learning theory. These fields have developed generalization bounds--a statistical description of a model's ability to generalize to new data based on factors such as:
        • the complexity of the model
        • the model's performance on training data
      • A machine learning model aims to make good predictions on new, previously unseen data. But if you are building a model from your data set, how would you get the previously unseen data? Well, one way is to divide your data set into two subsets:
        • training set—a subset to train a model.
        • test set—a subset to test the model.
      • Good performance on the test set is a useful indicator of good performance on the new data in general, assuming that:
        • The test set is large enough.
        • You don't cheat by using the same test set over and over.
      • Summary
        • Overfitting occurs when a model tries to fit the training data so closely that it does not generalize well to new data.
        • If the key assumptions of supervised ML are not met, then we lose important theoretical guarantees on our ability to predict on new data.

    Training and Test Sets

    • test set is a data set used to evaluate the model developed from a training set.
    • What If We Only Have One Data Set?
      • Divide into two sets:
        • training set
        • test set
      • Classic gotcha: do not train on test data
        • Getting surprisingly low loss?
        • Before celebrating, check if you're accidentally training on test data
    • Splitting Data
      • Make sure that your test set meets the following two conditions:
        • Is large enough to yield statistically meaningful results.
        • Is representative of the data set as a whole. In other words, don't pick a test set with different characteristics than the training set.
      • Never train on test data.

    Validation Set

    • Check Your Intuition
      • We looked at a process of using a test set and a training set to drive iterations of model development. On each iteration, we'd train on the training data and evaluate on the test data, using the evaluation results on test data to guide choices of and changes to various model hyperparameters like learning rate and features. Is there anything wrong with this approach?
      • Doing many rounds of this procedure might cause us to implicitly fit to the peculiarities of our specific test set.
      • Yes indeed! The more often we evaluate on a given test set, the more we are at risk for implicitly overfitting to that one test set. We'll look at a better protocol next.
    • Partitioning a data set into a training set and test set lets you judge whether a given model will generalize well to new data. However, using only two partitions may be insufficient when doing many rounds of hyperparameter tuning.
    • Another Partition
      • Dividing the data set into two sets is a good idea, but not a panacea. You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:

    A horizontal bar divided into three pieces: 70% of which is the training set, 15% the validation set, and 15% the test set

    Figure 2. Slicing a single data set into three subsets.

      • Use the validation set to evaluate results from the training set. Then, use the test set to double-check your evaluationafter the model has "passed" the validation set. The following figure shows this new workflow:

    Similar workflow to Figure 1, except that instead of evaluating the model against the test set, the workflow evaluates the model against the validation set. Then, once the training set and validation set more-or-less agree, confirm the model against the test set.

    Figure 3. A better workflow.

      • In this improved workflow:
        • Pick the model that does best on the validation set.
        • Double-check that model against the test set.
      • This is a better workflow because it creates fewer exposures to the test set.
      • Tip
        • Test sets and validation sets "wear out" with repeated use. That is, the more you use the same data to make decisions about hyperparameter settings or other model improvements, the less confidence you'll have that these results actually generalize to new, unseen data. Note that validation sets typically wear out more slowly than test sets.
        • If possible, it's a good idea to collect more data to "refresh" the test set and validation set. Starting anew is a great reset.

    Representation

    • A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.
    • Properties of a Good Feature
      • Feature values should appear with non-zero value more than a small handful of times in the dataset.
      • Features should have a clear, obvious meaning.
      • Features shouldn't take on "magic" values
      • The definition of a feature shouldn't change over time.
      • Distribution should not have crazy outliers
    • The Binning Trick
      • Create several boolean bins, each mapping to a new unique feature
      • Allows model to fit a different value for each bin
    • Good Habits
      • KNOW YOUR DATA
        • Visualize: Plot histograms, rank most to least common.
        • Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
        • Monitor: Feature quantiles, number of examples over time?
    • Feature Engineering
      • In traditional programming, the focus is on code. In machine learning projects, the focus shifts to representation. That is, one way developers hone a model is by adding and improving its features.
      • Mapping Raw Data to Features
        • Feature engineering means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering.
        • Many machine learning models must represent the features as real-numbered vectors since the feature values must be multiplied by the model weights.

    Raw data is mapped to a feature vector through a process called feature engineering.

      • Mapping numeric values

    An example of a feature that can be copied directly from the raw data

      • Mapping categorical values
        • Categorical features have a discrete set of possible values.
        • Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values.
        • We can accomplish this by defining a mapping from the feature values, which we'll refer to as the vocabulary of possible values, to integers. Since not every street in the world will appear in our dataset, we can group all other streets into a catch-all "other" category, known as an OOV (out-of-vocabulary) bucket.
        • To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows:
          • For values that apply to the example, set corresponding vector elements to 1.
          • Set all other elements to 0.
        • The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1.

    Mapping a string value ("Shorebird Way") to a sparse vector, via one-hot encoding.

        • One-hot encoding extends to numeric data that you do not want to directly multiply by a weight, such as a postal code.
      • Sparse Representation
        • Suppose that you had 1,000,000 different street names in your data set that you wanted to include as values for street_name. Explicitly creating a binary vector of 1,000,000 elements where only 1 or 2 elements are true is a very inefficient representation in terms of both storage and computation time when processing these vectors. In this situation, a common approach is to use a sparse representation in which only nonzero values are stored. In sparse representations, an independent model weight is still learned for each feature value, as described above.
    • Qualities of Good Features
      • Avoid rarely used discrete feature values
        • Good feature values should appear more than 5 or so times in a data set. Doing so enables a model to learn how this feature value relates to the label. That is, having many examples with the same discrete value gives the model a chance to see the feature in different settings, and in turn, determine when it's a good predictor for the label.
        • Conversely, if a feature's value appears only once or very rarely, the model can't make predictions based on that feature. 
      • Prefer clear and obvious meanings
        • Each feature should have a clear and obvious meaning to anyone on the project.
        • Conversely, the meaning of the following feature value is pretty much indecipherable to anyone but the engineer who created it:
          • house_age: 851472000
        • In some cases, noisy data (rather than bad engineering choices) causes unclear values.
      • Don't mix "magic" values with actual data
        • Good floating-point features don't contain peculiar out-of-range discontinuities or "magic" values.
        • However, if a user didn't enter a quality_rating, perhaps the data set represented its absence with a magic value like the following:
          • quality_rating: -1
        • To explicitly mark magic values, create a Boolean feature that indicates whether or not a quality_rating was supplied. Give this Boolean feature a name like is_quality_rating_defined.
        • In the original feature, replace the magic values as follows:
          • For variables that take a finite set of values (discrete variables), add a new value to the set and use it to signify that the feature value is missing.
          • For continuous variables, ensure missing values do not affect the model by using the mean value of the feature's data.
      • Account for upstream instability
        • The definition of a feature shouldn't change over time.
        • But gathering a value inferred by another model carries additional costs. Perhaps the value "219" currently represents Sao Paulo, but that representation could easily change on a future run of the other model:
          • inferred_city_cluster: "219"
    • Cleaning Data
      • As an ML engineer, you'll spend enormous amounts of your time tossing out bad examples and cleaning up the salvageable ones. Even a few "bad apples" can spoil a large data set.  
      • Scaling feature values 
        • Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:
          • Helps gradient descent converge more quickly.
          • Helps avoid the "NaN trap," in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and—due to math operations—every other number in the model also eventually becomes a NaN.
          • Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.
        • You don't have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000.
        • One obvious way to scale numerical data is to linearly map [min value, max value] to a small scale, such as [-1, +1].
        • Another popular scaling tactic is to calculate the Z score of each value. The Z score relates the number of standard deviations away from the mean. In other words:
          • scaledvalue = (value−mean) / stddev
      • Handling extreme outliers
        • The plot shows that the vast majority of areas in California have one or two rooms per person. But take a look along the x-axis.

    A plot of roomsPerPerson in which nearly all the values are clustered between 0 and 4, but there's a verrrrry long tail reaching all the way out to 55 rooms per person

    Figure 4. A verrrrry lonnnnnnng tail.

        • How could we minimize the influence of those extreme outliers? Well, one way would be to take the log of every value:

    A plot of log(roomsPerPerson) in which 99% of values cluster between about 0.4 and 1.8, but there's still a longish tail that goes out to 4.2 or so.

    Figure 5. Logarithmic scaling still leaves a tail.

        • Log scaling does a slightly better job, but there's still a significant tail of outlier values. Let's pick yet another approach. What if we simply "cap" or "clip" the maximum value of roomsPerPerson at an arbitrary value, say 4.0?

    A plot of roomsPerPerson in which all values lie between -0.3 and 4.0. The plot is bell-shaped, but there's an anomalous hill at 4.0

    Figure 6. Clipping feature values at 4.0

        • Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.
      • Binning
        • The following plot shows the relative prevalence of houses at different latitudes in California. Notice the clustering—Los Angeles is about at latitude 34 and San Francisco is roughly at latitude 38.

    A plot of houses per latitude. The plot is highly irregular, containing doldrums around latitude 36 and huge spikes around latitudes 34 and 38.

    Figure 7. Houses per latitude.

        • In the data set, latitude is a floating-point value. However, it doesn't make sense to represent latitude as a floating-point feature in our model. That's because no linear relationship exists between latitude and housing values. For example, houses in latitude 35 are not 35 / 34 more expensive (or less expensive) than houses at latitude 34. And yet, individual latitudes probably are a pretty good predictor of house values.
        • To make latitude a helpful predictor, let's divide latitudes into "bins" as suggested by the following figure:

    A plot of houses per latitude. The plot is divided into "bins" between whole number latitudes.

    Figure 8. Binning values.

        • Instead of having one floating-point feature, we now have 11 distinct boolean features (LatitudeBin1LatitudeBin2, ..., LatitudeBin11). Having 11 separate features is somewhat inelegant, so let's unite them into a single 11-element vector. Doing so will enable us to represent latitude 37.4 as follows:
        • [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
        • Thanks to binning, our model can now learn completely different weights for each latitude.
        • For simplicity's sake in the latitude example, we used whole numbers as bin boundaries. Had we wanted finer-grain resolution, we could have split bin boundaries at, say, every tenth of a degree. Adding more bins enables the model to learn different behaviors from latitude 37.4 than latitude 37.5, but only if there are sufficient examples at each tenth of a latitude.
        • Another approach is to bin by quantile, which ensures that the number of examples in each bucket is equal. Binning by quantile completely removes the need to worry about outliers.
      • Scrubbing
        • In real-life, many examples in data sets are unreliable due to one or more of the following:
          • Omitted values. For instance, a person forgot to enter a value for a house's age.
          • Duplicate examples. For example, a server mistakenly uploaded the same logs twice.
          • Bad labels. For instance, a person mislabeled a picture of an oak tree as a maple.
          • Bad feature values. For example, someone typed in an extra digit, or a thermometer was left out in the sun.
        • Once detected, you typically "fix" bad examples by removing them from the data set. To detect omitted values or duplicated examples, you can write a simple program. Detecting bad feature values or labels can be far trickier.
        • In addition to detecting bad individual examples, you must also detect bad data in the aggregate. Histograms are a great mechanism for visualizing your data in the aggregate. In addition, getting statistics like the following can help:
          • Maximum and minimum
          • Mean and median
          • Standard deviation
        • Consider generating lists of the most common values for discrete features.
      • Know your data
        • Follow these rules:
          • Keep in mind what you think your data should look like.
          • Verify that the data meets these expectations (or that you can explain why it doesn’t).
          • Double-check that the training data agrees with other sources (for example, dashboards).
        • Treat your data with all the care that you would treat any mission-critical code. Good ML relies on good data.  
      • Additional Information

    Feature Crosses

    • feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually. 
    • Feature Crosses
      • Feature crosses is the name of this approach
      • Define templates of the form [A x B]
      • Can be complex: [A x B x C x D x E]
      • When A and B represent boolean features, such as bins, the resulting crosses can be extremely sparse
    • Feature Crosses: Why would we do this?
      • Linear learners use linear models
      • Such learners scale well to massive data e.g., vowpal-wabit, sofia-ml
      • But without feature crosses, the expressivity of these models would be limited
      • Using feature crosses + massive data is one efficient strategy for learning highly complex models
        • Foreshadowing: neural nets provide another
    • Encoding Nonlinearity
      • feature cross is a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together. (The term cross comes from cross product.) Let's create a feature cross named x3 by crossing x1 and x2:
        • x3=x1x2  
      • Thanks to stochastic gradient descent, linear models can be trained efficiently. Consequently, supplementing scaled linear models with feature crosses has traditionally been an efficient way to train on massive-scale data sets.  
    • Crossing One-Hot Vectors
      • So far, we've focused on feature-crossing two individual floating-point features. In practice, machine learning models seldom cross continuous features. However, machine learning models do frequently cross one-hot feature vectors. Think of feature crosses of one-hot feature vectors as logical conjunctions.
      • Linear learners scale well to massive data. Using feature crosses on massive data sets is one efficient strategy for learning highly complex models. Neural networks provide another strategy.
      • Different cities in California have markedly different housing prices. Suppose you must create a model to predict housing prices. Which of the following sets of features or feature crosses could learn city-specific relationships between roomsPerPerson and housing price?
        • One feature cross: [binned latitude X binned longitude X binned roomsPerPerson]
        • Crossing binned latitude with binned longitude enables the model to learn city-specific effects of roomsPerPerson. Binning prevents a change in latitude producing the same result as a change in longitude. Depending on the granularity of the bins, this feature cross could learn city-specific or neighborhood-specific or even block-specific effects.

    Regularization for Simplicity

    • Regularization means penalizing the complexity of a model to reduce overfitting.
    • Penalizing Model Complexity
      • We want to avoid model complexity where possible.
      • We can bake this idea into the optimization we do at training time.
      • Empirical Risk Minimization:
        • aims for low training error
          • minimize: Loss(Data|Model)
        • while balancing against complexity
          • minimize: Loss(Data|Model)+complexity(Model)
    • Regularization
      • How to define complexity(Model)?
      • Prefer smaller weights
      • Diverging from this should incur a cost
      • Can encode this idea via Lregularization (a.k.a. ridge)
        • complexity(model) = sum of the squares of the weights
        • Penalizes really big weights
        • For linear models: prefers flatter slopes
        • Bayesian prior:
          • weights should be centered around zero
          • weights should be normally distributed
    • A Loss Function with L2 Regularization
      • Loss(Data|Model) + λ(w1^2+…+wn^2)
      • Where:
        • Loss: Aims for low training error
        • λ: Scalar value that controls how weights are balanced
        • w1^2+…+wn^2: Square of L2 norm 
    • L₂ Regularization
      • In other words, this generalization curve shows that the model is overfitting to the data in the training set. Channeling our inner Ockham, perhaps we could prevent overfitting by penalizing complex models, a principle called regularization.
      • In other words, instead of simply aiming to minimize loss (empirical risk minimization):
        • minimize(Loss(Data|Model))
      • we'll now minimize loss+complexity, which is called structural risk minimization:
        • minimize(Loss(Data|Model) + complexity(Model))
      • Our training optimization algorithm is now a function of two terms: the loss term, which measures how well the model fits the data, and the regularization term, which measures model complexity.
      • Machine Learning Crash Course focuses on two common (and somewhat related) ways to think of model complexity:
        • Model complexity as a function of the weights of all the features in the model.
        • Model complexity as a function of the total number of features with nonzero weights. (A later module covers this approach.)
      • If model complexity is a function of weights, a feature weight with a high absolute value is more complex than a feature weight with a low absolute value.
      • We can quantify complexity using the L2 regularization formula, which defines the regularization term as the sum of the squares of all the feature weights:
        • L2 regularization term = ||w||2^2 = w1^2 + w2^2 + ... + wn^2
      • In this formula, weights close to zero have little effect on model complexity, while outlier weights can have a huge impact.
    • Lambda
      • Model developers tune the overall impact of the regularization term by multiplying its value by a scalar known as lambda(also called the regularization rate). That is, model developers aim to do the following:
        • minimize(Loss(Data|Model)+λ complexity(Model))
      • Performing L2 regularization has the following effect on a model
        • Encourages weight values toward 0 (but not exactly 0)
        • Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.
      • Increasing the lambda value strengthens the regularization effect.
      • When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit:
        • If your lambda value is too high, your model will be simple, but you run the risk of underfitting your data. Your model won't learn enough about the training data to make useful predictions.
        • If your lambda value is too low, your model will be more complex, and you run the risk of overfitting your data. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.
      • Note: Setting lambda to zero removes regularization completely. In this case, training focuses exclusively on minimizing loss, which poses the highest possible overfitting risk.
      • The ideal value of lambda produces a model that generalizes well to new, previously unseen data. Unfortunately, that ideal value of lambda is data-dependent, so you'll need to do some tuning.

      •  

        There's a close connection between learning rate and lambda. Strong L2 regularization values tend to drive feature weights closer to 0. Lower learning rates (with early stopping) often produce the same effect because the steps away from 0 aren't as large. Consequently, tweaking learning rate and lambda simultaneously may have confounding effects.
      • Early stopping means ending training before the model fully reaches convergence. In practice, we often end up with some amount of implicit early stopping when training in an online (continuous) fashion. That is, some new trends just haven't had enough data yet to converge.
      • As noted, the effects from changes to regularization parameters can be confounded with the effects from changes in learning rate or number of iterations. One useful practice (when training across a fixed batch of data) is to give yourself a high enough number of iterations that early stopping doesn't play into things.
      • L2 regularization may cause the model to learn a moderate weight for some non-informative features.
        • Surprisingly, this can happen when a non-informative feature happens to be correlated with the label. In this case, the model incorrectly gives such non-informative features some of the "credit" that should have gone to informative features.  
      • L2 regularization will encourage many of the non-informative weights to be nearly (but not exactly) 0.0.
      • L2 regularization will force the features towards roughly equivalent weights that are approximately half of what they would have been had only one of the two features been in the model.

    Logistic Regression

    • Instead of predicting exactly 0 or 1, logistic regression generates a probability—a value between 0 and 1, exclusive.
    • Many problems require a probability estimate as output
    • Enter Logistic Regression
    • Handy because the probability estimates are calibrated
      • for example, p(house will sell) * price = expected outcome
    • Also useful for when we need a binary classification
      • spam or not spam? → p(Spam)
    • Predictions
      • y′ = 1 / (1 + e−(wTx + b) )
      • Where:x: Provides the familiar linear model 1 + e−(...): Squish through a sigmoid
    Graph of logistic-regression equation
    • LogLoss Defined
      • LogLoss = ∑(x,y)∈D −ylog(y′) − (1 − y)log(1 − y′)  

    Two graphs of Log Loss vs. predicted value: one for a target value of 0.0 (which arcs up and to the right) and one for a target value of 1.0 (which arcs down and to the left) 

    • Logistic Regression and Regularization
      • Regularization is super important for logistic regression.
        • Remember the asymptotes
        • It'll keep trying to drive loss to 0 in high dimensions
      • Two strategies are especially useful:
        • L2 regularization (aka L2 weight decay) - penalizes huge weights.
        • Early stopping - limiting training steps or learning rate.
    • Linear Logistic Regression    
      • Linear logistic regression is extremely efficient.
        • Very fast training and prediction times.
        • Short / wide models use a lot of RAM.
    • Calculating a Probability    
      • Many problems require a probability estimate as output. Logistic regression is an extremely efficient mechanism for calculating probabilities. Practically speaking, you can use the returned probability in either of the following two ways:
        • "As is"
        • Converted to a binary category.
      • In many cases, you'll map the logistic regression output into the solution to a binary classification problem, in which the goal is to correctly predict one of two possible labels (e.g., "spam" or "not spam"). A later module focuses on that.  
      • If z represents the output of the linear layer of a model trained with logistic regression, then sigmoid(z) will yield a value (a probability) between 0 and 1. In mathematical terms:
        • y′ = 1 / (1 + e−(z) )  
      • where:
        • y' is the output of the logistic regression model for a particular example.
        • z is b + w1x1 + w2x2 + ... wNxN
          • The w values are the model's learned weights, and b is the bias.
          • The x values are the feature values for a particular example.
        • Note that z is also referred to as the log-odds because the inverse of the sigmoid states that z can be defined as the log of the probability of the "1" label (e.g., "dog barks") divided by the probability of the "0" label (e.g., "dog doesn't bark"):
          • z = log( y / (1−y) )
          • Here is the sigmoid function with ML labels:

    The Sigmoid function with the x-axis labeled as the sum of all the weights and features (plus the bias); the y-axis is labeled Probability Output.

    Figure 2: Logistic regression output.

    • Loss and Regularization
      • Loss function for Logistic Regression
        • The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows:
          • Log Loss=∑(x,y)∈D−ylog⁡(y′)−(1−y)log⁡(1−y′)
        • where:
          • (x,y)∈D is the data set containing many labeled examples, which are (x,y) pairs.
          • y is the label in a labeled example. Since this is logistic regression, every value of y must either be 0 or 1.
          • y′ is the predicted value (somewhere between 0 and 1), given the set of features in x.
        • The equation for Log Loss is closely related to Shannon's Entropy measure from Information Theory. It is also the negative logarithm of the likelihood function, assuming a Bernoulli distribution of y. Indeed, minimizing the loss function yields a maximum likelihood estimate.
      • Regularization in Logistic Regression
        • Regularization is extremely important in logistic regression modeling. Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions. Consequently, most logistic regression models use one of the following two strategies to dampen model complexity:
          • L2 regularization.
          • Early stopping, that is, limiting the number of training steps or the learning rate.
        • (We'll discuss a third strategy—L1 regularization—in a later module.)
      • Summary
        • Logistic regression models generate probabilities.
        • Log Loss is the loss function for logistic regression.
        • Logistic regression is widely used by many practitioners.

    Classification

    Thresholding

    • Logistic regression returns a probability. You can use the returned probability "as is" (for example, the probability that the user will click on this ad is 0.00023) or convert the returned probability to a binary value (for example, this email is spam).
    • In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.
    • Note: "Tuning" a threshold for logistic regression is different from tuning hyperparameters such as learning rate. Part of choosing a threshold is assessing how much you'll suffer for making a mistake. For example, mistakenly labeling a non-spam message as spam is very bad. However, mistakenly labeling a spam message as non-spam is unpleasant, but hardly the end of your job.

    True vs. False and Positive vs. Negative

    • A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
    • A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

    Accuracy

    • Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:
      • Accuracy = Number of correct predictions / Total number of predictions
    • For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:
      • Accuracy = (TP+TN) / (TP+TN+FP+FN)
      • Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.
    • Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set, like this one, where there is a significant disparity between the number of positive and negative labels.

    Precision and Recall

    • Precision attempts to answer the following question:
      • What proportion of positive identifications was actually correct?
    • Precision is defined as follows:
      • Precision=TP / (TP+FP)
    • Note: A model that produces no false positives has a precision of 1.0.
    • Recall attempts to answer the following question:
      • What proportion of actual positives was identified correctly?
    • Mathematically, recall is defined as follows:
      • Recall=TP / (TP+FN)
    • Note: A model that produces no false negatives has a recall of 1.0.
    • To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa.
    • Various metrics have been developed that rely on both precision and recall. For example, see F1 score
    • In general, raising the classification threshold reduces false positives, thus raising precision.
    • Raising our classification threshold will cause the number of true positives to decrease or stay the same and will cause the number of false negatives to increase or stay the same. Thus, recall will either stay constant or decrease.
    • In general, a model that outperforms another model on both precision and recall is likely the better model. Obviously, we'll need to make sure that comparison is being done at a precision / recall point that is useful in practice for this to be meaningful. For example, suppose our spam detection model needs to have at least 90% precision to be useful and avoid unnecessary false alarms. In this case, comparing one model at {20% precision, 99% recall} to another at {15% precision, 98% recall} is not particularly instructive, as neither model meets the 90% precision requirement. But with that caveat in mind, this is a good way to think about comparing models when using precision and recall.

    ROC Curve and AUC

    • An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
      • True Positive Rate
      • False Positive Rate
    • True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
      • TPR=TP / (TP+FN)
    • False Positive Rate (FPR) is defined as follows:
      • FPR=FP / (FP+TN)
    • An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.

    ROC Curve showing TP Rate vs. FP Rate at different classification thresholds.

    Figure 4. TP vs. FP rate at different classification thresholds.

    • To compute the points in an ROC curve, we could evaluate a logistic regression model many times with different classification thresholds, but this would be inefficient. Fortunately, there's an efficient, sorting-based algorithm that can provide this information for us, called AUC.
    • AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

    AUC (Area under the ROC Curve).

    Figure 5. AUC (Area under the ROC Curve).

    • AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
    • AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
    • AUC is desirable for the following two reasons:
      • AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
      • AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.
    • However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:
      • Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.
      • Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.
    • In practice, if you have a "perfect" classifier with an AUC of 1.0, you should be suspicious, as it likely indicates a bug in your model. For example, you may have overfit to your training data, or the label data may be replicated in one of your features.
    • AUC is based on the relative predictions, so any transformation of the predictions that preserves the relative ranking has no effect on AUC. This is clearly not the case for other metrics such as squared error, log loss, or prediction bias (discussed later).

    Prediction Bias

    • Logistic regression predictions should be unbiased. That is:
    • "average of predictions" should ≈ "average of observations"
    • Prediction bias is a quantity that measures how far apart those two averages are. That is:
    • prediction bias = average of predictions − average of labels in data set
    • Note: "Prediction bias" is a different quantity than bias (the b in wx + b).
    • A significant nonzero prediction bias tells you there is a bug somewhere in your model, as it indicates that the model is wrong about how frequently positive labels occur.

    • Possible root causes of prediction bias are:
      • Incomplete feature set
      • Noisy data set
      • Buggy pipeline
      • Biased training sample
      • Overly strong regularization
    • You might be tempted to correct prediction bias by post-processing the learned model—that is, by adding a calibration layer that adjusts your model's output to reduce the prediction bias.
    • However, adding a calibration layer is a bad idea for the following reasons:
      • You're fixing the symptom rather than the cause.
      • You've built a more brittle system that you must now keep up to date.
    • If possible, avoid calibration layers. Projects that use calibration layers tend to become reliant on them—using calibration layers to fix all their model's sins. Ultimately, maintaining the calibration layers can become a nightmare.
    • Note: A good model will usually have near-zero bias. That said, a low prediction bias does not prove that your model is good. A really terrible model could have a zero prediction bias. For example, a model that just predicts the mean value for all examples would be a bad model, despite having zero bias.
    • Logistic regression predicts a value between 0 and 1. However, all labeled examples are either exactly 0 (meaning, for example, "not spam") or exactly 1 (meaning, for example, "spam"). Therefore, when examining prediction bias, you cannot accurately determine the prediction bias based on only one example; you must examine the prediction bias on a "bucket" of examples. That is, prediction bias for logistic regression only makes sense when grouping enough examples together to be able to compare a predicted value (for example, 0.392) to observed values (for example, 0.394).
    • You can form buckets in the following ways:
      • Linearly breaking up the target predictions.
      • Forming quantiles.

    Regularization for Sparsity

    • This module focuses on the special requirements for models learned on feature vectors that have many dimensions.
      • Let's Go Back to Feature Crosses
        • Caveat: Sparse feature crosses may significantly increase feature space
        • Possible issues:
          • Model size (RAM) may become huge
          • "Noise" coefficients (causes overfitting)
      • L1 Regularization
        • Would like to penalize L0 norm of weights
          • Non-convex optimization; NP hard
        • Relax to L1 regularization:
          • Penalize sum of abs(weights)
          • Convex problem
          • Encourage sparsity unlike L2
    • L1 Regularization
      • Sparse vectors often contain many dimensions. Creating a feature cross results in even more dimensions. Given such high-dimensional feature vectors, model size may become huge and require huge amounts of RAM.
      • In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. A weight of exactly 0 essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model.
      • L2 regularization encourages weights to be small, but doesn't force them to exactly 0.0.
      • An alternative idea would be to try and create a regularization term that penalizes the count of non-zero coefficient values in a model. Increasing this count would only be justified if there was a sufficient gain in the model's ability to fit the data. Unfortunately, while this count-based approach is intuitively appealing, it would turn our convex optimization problem into a non-convex optimization problem that's NP-hard. (If you squint, you can see a connection to the knapsack problem.) So this idea, known as L0 regularization isn't something we can use effectively in practice.
      • However, there is a regularization term called L1 regularization that serves as an approximation to L0, but has the advantage of being convex and thus efficient to compute. So we can use L1 regularization to encourage many of the uninformative coefficients in our model to be exactly 0, and thus reap RAM savings at inference time.
      • L1 vs L2 regularization.
        • L2 and L1 penalize weights differently:
          • L2 penalizes weight2.
          • L1 penalizes |weight|.
        • Consequently, L2 and L1 have different derivatives:
          • The derivative of L2 is 2 * weight.
          • The derivative of L1 is k (a constant, whose value is independent of weight).
        • You can think of the derivative of L2 as a force that removes x% of the weight every time. As Zeno knew, even if you remove x percent of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, L2 does not normally drive weights to zero.
        • You can think of the derivative of L1 as a force that subtracts some constant from the weight every time. However, thanks to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For example, if subtraction would have forced a weight from +0.1 to -0.2, L1 will set the weight to exactly 0. Eureka, L1 zeroed out the weight.
        • L1 regularization—penalizing the absolute value of all the weights—turns out to be quite efficient for wide models.
    • L1 regularization may cause informative features to get a weight of exactly 0.0.
      • L1 regularization may cause the following kinds of features to be given weights of exactly 0:
      • Weakly informative features.
      • Strongly informative features on different scales.
      • Informative features strongly correlated with other similarly informative features.
    • L1 regularization will encourage most of the non-informative weights to be exactly 0.0.
      • In general, L1 regularization of sufficient lambda tends to encourage non-informative features to weights of exactly 0.0. Unlike L2 regularization, L1 regularization "pushes" just as hard toward 0.0 no matter how far the weight is from 0.0.
    • Which type of regularization will produce the smaller model?
      • L1 regularization tends to reduce the number of features. In other words, L1 regularization often reduces the model size.
      • L2 regularization rarely reduces the number of features. In other words, L2 regularization rarely reduces the model size.

    Neural Networks

    • Neural networks are a more sophisticated version of feature crosses. In essence, neural networks learn the appropriate feature crosses for you.
      • A Linear Model

    Three blue circles in a row connected by arrows to a green circle above them

      • Add Complexity: Non-Linear?

    Three blue circles in a row labeled "Input" connected by arrows to a row of yellow circles labeled "Hidden Layer" above them, which are in turn connected to a green circle labeled "Output" at the top.

      • More Complex: Non-Linear?

    Three blue circles in a row labeled "Input" connected by arrows to a row of yellow circles labeled "Hidden Layer" above them, which are connected by arrows to a second "Hidden Layer" row of yellow circles, which are in turn connected to a green circle labeled "Output" at the top.

      • Adding a Non-Linearity

    The same as the previous figure, except that a row of pink circles labeled 'Non-Linear Transformation Layer' has been added in between the two hidden layers.

      • Our Favorite Non-Linearity

    A graph with slope of 0 and then linear once it passes x=0

      • Neural Nets Can Be Arbitrarily Complex

    A complex neural network

    • Structure 
      • "Nonlinear" means that you can't accurately predict a label with a model of the form b+w1x1+w2x2 In other words, the "decision surface" is not a line.
      • Hidden Layers
        • In the model represented by the following graph, we've added a "hidden layer" of intermediary values. Each yellow node in the hidden layer is a weighted sum of the blue input node values. The output is a weighted sum of the yellow nodes.
      • Activation Functions
        • To model a nonlinear problem, we can directly introduce a nonlinearity. We can pipe each hidden layer node through a nonlinear function.
        • In the model represented by the following graph, the value of each node in Hidden Layer 1 is transformed by a nonlinear function before being passed on to the weighted sums of the next layer. This nonlinear function is called the activation function.
      • Common Activation Functions
        • The following sigmoid activation function converts the weighted sum to a value between 0 and 1.
          • F(x)=1 / ( 1 + e−x )

    Sigmoid function

    Figure 7. Sigmoid activation function.

        • The following rectified linear unit activation function (or ReLU, for short) often works a little better than a smooth function like the sigmoid, while also being significantly easier to compute.
          • F(x) = max( 0, x )
        • The superiority of ReLU is based on empirical findings, probably driven by ReLU having a more useful range of responsiveness. A sigmoid's responsiveness falls off relatively quickly on both sides.

    ReLU activation function

    Figure 8. ReLU activation function.

        • In fact, any mathematical function can serve as an activation function. Suppose that σ represents our activation function (Relu, Sigmoid, or whatever). Consequently, the value of a node in the network is given by the following formula:
          • σ( w ⋅ x + b )
        • TensorFlow provides out-of-the-box support for a wide variety of activation functions. That said, we still recommend starting with ReLU.
      • Summary
        • Now our model has all the standard components of what people usually mean when they say "neural network":
          • A set of nodes, analogous to neurons, organized in layers.
          • A set of weights representing the connections between each neural network layer and the layer beneath it. The layer beneath may be another neural network layer, or some other kind of layer.
          • A set of biases, one for each node.
          • An activation function that transforms the output of each node in a layer. Different layers may have different activation functions.
      • A caveat: neural networks aren't necessarily always better than feature crosses, but neural networks do offer a flexible alternative that works well in many cases.

    Training Neural Networks

    • Backpropagation is the most common training algorithm for neural networks. It makes gradient descent feasible for multi-layer neural networks. TensorFlow handles backpropagation automatically, so you don't need a deep understanding of the algorithm. To get a sense of how it works, walk through the following: Backpropagation algorithm visual explanation. As you scroll through the preceding explanation, note the following:
      • How data flows through the graph.
      • How dynamic programming lets us avoid computing exponentially many paths through the graph. Here "dynamic programming" just means recording intermediate results on the forward and backward passes.
    • Backprop: What You Need To Know
      • Gradients are important
        • If it's differentiable, we can probably learn on it
      • Gradients can vanish
        • Each additional layer can successively reduce signal vs. noise
        • ReLus are useful here
      • Gradients can explode
        • Learning rates are important here
        • Batch normalization (useful knob) can help
      • ReLu layers can die
        • Keep calm and lower your learning rates
    • Normalizing Feature Values
      • We'd like our features to have reasonable scales
        • Roughly zero-centered, [-1, 1] range often works well
        • Helps gradient descent converge; avoid NaN trap
        • Avoiding outlier values can also help
      • Can use a few standard methods:
        • Linear scaling
        • Hard cap (clipping) to max, min
        • Log scaling
    • Dropout Regularization
      • Dropout: Another form of regularization, useful for NNs
      • Works by randomly "dropping out" units in a network for a single gradient step
        • There's a connection to ensemble models here
      • The more you drop out, the stronger the regularization
        • 0.0 = no dropout regularization
        • 1.0 = drop everything out! learns nothing
        • Intermediate values more useful
    • Best Practices
      • Failure Cases
        • Vanishing Gradients
          • The gradients for the lower layers (closer to the input) can become very small. In deep networks, computing these gradients can involve taking the product of many small terms.
          • When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all.
          • The ReLU activation function can help prevent vanishing gradients.
        • Exploding Gradients
          • If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge.
          • Batch normalization can help prevent exploding gradients, as can lowering the learning rate.
        • Dead ReLU Units
          • Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network's output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0.
          • Lowering the learning rate can help keep ReLU units from dying.
      • Dropout Regularization
        • Yet another form of regularization, called Dropout, is useful for neural networks. It works by randomly "dropping out" unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization:
          • 0.0 = No dropout regularization.
          • 1.0 = Drop out everything. The model learns nothing.
          • Values between 0.0 and 1.0 = More useful.

    Multi-Class Neural Networks

    • In this module, we'll investigate multi-class classification, which can pick from multiple possibilities.
    • Some real-world multi-class problems entail choosing from millions of separate classes.
    • More than two classes?
      • Logistic regression gives useful probabilities for binary-class problems.
        • spam / not-spam
        • click / not-click
      • What about multi-class problems?
        • apple, banana, car, cardiologist, ..., walk sign, zebra, zoo
        • red, orange, yellow, green, blue, indigo, violet
        • animal, vegetable, mineral
    • One-Vs-All Multi-Class
      • Create a unique output for each possible class
      • Train that on a signal of "my class" vs "all other classes"
      • Can do in a deep network, or with separate models

    A neural network with five hidden layers and five output layers.

    • SoftMax Multi-Class
      • Add an additional constraint: Require output of all one-vs-all nodes to sum to 1.0
      • The additional constraint helps training converge quickly
      • Plus, allows outputs to be interpreted as probabilities

    A deep neural net with an input layer, two nondescript hidden layers, then a Softmax layer, and finally an output layer with the same number of nodes as the Softmax layer.

    • What to use When?
      • Multi-Class, Single-Label Classification:
        • An example may be a member of only one class.
        • Constraint that classes are mutually exclusive is helpful structure.
        • Useful to encode this in the loss.
        • Use one softmax loss for all possible classes.
      • Multi-Class, Multi-Label Classification:
        • An example may be a member of more than one class.
        • No additional constraints on class membership to exploit.
        • One logistic regression loss for each possible class.
    • SoftMax Options
      • Full SoftMax
        • Brute force; calculates for all classes.
      • Candidate Sampling
        • Calculates for all the positive labels, but only for a random sample of negatives.
    • One vs. All
      • One vs. all provides a way to leverage binary classification. Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. During training, the model runs through a sequence of binary classifiers, training each to answer a separate classification question.
      • This approach is fairly reasonable when the total number of classes is small, but becomes increasingly inefficient as the number of classes rises.
      • We can create a significantly more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. The following figure suggests this approach:

    A neural network with five hidden layers and five output layers.

    Figure 1. A one-vs.-all neural network.

    • Softmax
      • Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.
      • Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.

    A deep neural net with an input layer, two nondescript hidden layers, then a Softmax layer, and finally an output layer with the same number of nodes as the Softmax layer.

    Figure 2. A Softmax layer within a neural network.

      • The Softmax equation is as follows:
        • p(y=j|x) = e(wjTx+bj) ∑k∈K e(wkTx+bk)
        • Note that this formula basically extends the formula for logistic regression into multiple classes.
      • Softmax Options
        • Consider the following variants of Softmax:
          • Full Softmax is the Softmax we've been discussing; that is, Softmax calculates a probability for every possible class.
          • Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we don't have to provide probabilities for every non-doggy example.
        • Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.
      • One Label vs. Many Labels
        • Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes.
        • For example, suppose your examples are images containing exactly one item—a piece of fruit. Softmax can determine the likelihood of that one item being a pear, an orange, an apple, and so on. If your examples are images containing all sorts of things—bowls of different kinds of fruit—then you'll have to use multiple logistic regressions instead.

    Embeddings

    • An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models. 

    ML Engineering


    Fairness

    • Evaluating a machine learning model responsibly requires doing more than just calculating loss metrics. Before putting a model into production, it's critical to audit training data and evaluate predictions for bias.
    • This module looks at different types of human biases that can manifest in training data. It then provides strategies to identify them and evaluate their effects.
    • What do you see?
      • Bananas
      • Stickers
      • Bananas on shelves
    A bunch of bananas
      • Green Bananas
      • Unripe Bananas
    A bunch of green bananas
      • Overripe Bananas
      • Good for Banana Bread
    A bunch of brown bananas
      • Yellow Bananas
      • Yellow is prototypical for bananas

    A diagram illustrating a typical machine learning workflow: collect data, then train a model, and then generate output

    Diagram illustrating two types of biases in data: human biases that manifest in data (such as out-group homogeneity bias), and human biases that affect data collection and annotation (such as confirmation bias)

    • Designing for Fairness
    1. Consider the problem
    2. Ask experts
    3. Train the models to account for bias
    4. Interpret outcomes
    5. Publish with context
    • Types of Bias
      • Machine learning models are not inherently objective. Engineers train models by feeding them a data set of training examples, and human involvement in the provision and curation of this data can make a model's predictions susceptible to bias.
      • When building models, it's important to be aware of common human biases that can manifest in your data, so you can take proactive steps to mitigate their effects.
      • WARNING: The following inventory of biases provides just a small selection of biases that are often uncovered in machine learning data sets; this list is not intended to be exhaustive. Wikipedia's catalog of cognitive biases enumerates over 100 different types of human bias that can affect our judgment. When auditing your data, you should be on the lookout for any and all potential sources of bias that might skew your model's predictions.
      • Reporting Bias
        • Reporting bias occurs when the frequency of events, properties, and/or outcomes captured in a data set does not accurately reflect their real-world frequency. This bias can arise because people tend to focus on documenting circumstances that are unusual or especially memorable, assuming that the ordinary can "go without saying."
          • EXAMPLE: A sentiment-analysis model is trained to predict whether book reviews are positive or negative based on a corpus of user submissions to a popular website. The majority of reviews in the training data set reflect extreme opinions (reviewers who either loved or hated a book), because people were less likely to submit a review of a book if they did not respond to it strongly. As a result, the model is less able to correctly predict sentiment of reviews that use more subtle language to describe a book.
      • Automation Bias
        • Automation bias is a tendency to favor results generated by automated systems over those generated by non-automated systems, irrespective of the error rates of each.
          • EXAMPLE: Software engineers working for a sprocket manufacturer were eager to deploy the new "groundbreaking" model they trained to identify tooth defects, until the factory supervisor pointed out that the model's precision and recall rates were both 15% lower than those of human inspectors.
      • Selection Bias
        • Selection bias occurs if a data set's examples are chosen in a way that is not reflective of their real-world distribution. Selection bias can take many different forms:
        • Coverage bias: Data is not selected in a representative fashion.
          • EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product. Consumers who instead opted to buy a competing product were not surveyed, and as a result, this group of people was not represented in the training data.
        • Non-response bias (or participation bias): Data ends up being unrepresentative due to participation gaps in the data-collection process.
          • EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Consumers who bought the competing product were 80% more likely to refuse to complete the survey, and their data was underrepresented in the sample.
        • Sampling bias: Proper randomization is not used during data collection.
          • EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Instead of randomly targeting consumers, the surveyer chose the first 200 consumers that responded to an email, who might have been more enthusiastic about the product than average purchasers.
      • Group Attribution Bias
        • Group attribution bias is a tendency to generalize what is true of individuals to an entire group to which they belong. Two key manifestations of this bias are:
        • In-group bias: A preference for members of a group to which you also belong, or for characteristics that you also share.
          • EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that applicants who attended the same computer-science academy as they both did are more qualified for the role.
        • Out-group homogeneity bias: A tendency to stereotype individual members of a group to which you do not belong, or to see their characteristics as more uniform.
          • EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that all applicants who did not attend a computer-science academy do not have sufficient expertise for the role.
      • Implicit Bias
        • Implicit bias occurs when assumptions are made based on one's own mental models and personal experiences that do not necessarily apply more generally.
          • EXAMPLE: An engineer training a gesture-recognition model uses a head shake as a feature to indicate a person is communicating the word "no." However, in some regions of the world, a head shake actually signifies "yes."
        • A common form of implicit bias is confirmation bias, where model builders unconsciously process data in ways that affirm preexisting beliefs and hypotheses. In some cases, a model builder may actually keep training a model until it produces a result that aligns with their original hypothesis; this is called experimenter's bias.
          • EXAMPLE: An engineer is building a model that predicts aggressiveness in dogs based on a variety of features (height, weight, breed, environment). The engineer had an unpleasant encounter with a hyperactive toy poodle as a child, and ever since has associated the breed with aggression. When the trained model predicted most toy poodles to be relatively docile, the engineer retrained the model several more times until it produced a result showing smaller poodles to be more violent.
    • Identifying Bias
      • As you explore your data to determine how best to represent it in your model, it's important to also keep issues of fairness in mind and proactively audit for potential sources of bias.
      • Missing Feature Values
        • If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.
        • Before training a model on this data, it would be prudent to investigate the cause of these missing values to ensure that there are no latent biases responsible for missing income and population data.
      • Unexpected Feature Values
        • When exploring data, you should also look for examples that contain feature values that stand out as especially uncharacteristic or unusual. These unexpected feature values could indicate problems that occurred during data collection or other inaccuracies that could introduce bias.
      • Data Skew
        • Any sort of skew in your data, where certain groups or characteristics may be under- or over-represented relative to their real-world prevalence, can introduce bias into your model.
    • Evaluating for Bias
      • When evaluating a model, metrics calculated against an entire test or validation set don't always give an accurate picture of how fair the model is.
      • Consider a new model developed to predict the presence of tumors that is evaluated against a validation set of 1,000 patients' medical records. 500 records are from female patients, and 500 records are from male patients.
      • When we calculate metrics separately for female and male patients, we see stark differences in model performance for each group.
      • We now have a much better understanding of the biases inherent in the model's predictions, as well as the risks to each subgroup if the model were to be released for medical use in the general population.
      • Additional Fairness Resources
        • Fairness is a relatively new subfield within the discipline of machine learning. To learn more about research and initiatives devoted to developing new tools and techniques for identifying and mitigating bias in machine learning models, check out Google's Machine Learning Fairness resources page.
  • 相关阅读:
    [CocosCreator]-12-音频播放
    [CocosCreator]-11-文本输入-EditBox 组件
    [CocosCreator]-10-Button(按钮)
    深入理解正则表达式高级教程
    正则表达式匹配次数
    如何理解正则表达式匹配过程
    正则表达式工具RegexBuddy使用教程
    正则表达式批量替换通过$
    正则前面的 (?i) (?s) (?m) (?is) (?im)
    Net操作Excel_NPOI
  • 原文地址:https://www.cnblogs.com/pegasus923/p/10508444.html
Copyright © 2011-2022 走看看