学习笔记之Machine Learning Crash Course | Google Developers - 走看看

zoukankan html css js c++ java

学习笔记之Machine Learning Crash Course | Google Developers
Machine Learning Crash Course | Google Developers
- https://developers.google.com/machine-learning/crash-course/
- Google's fast-paced, practical introduction to machine learning
ML Concepts

Introduction to Machine Learning
- As you'll discover, machine learning requires a different mindset than other programming problems.
  
  For example, real-world machine learning focuses far more on data analysis than on coding.
Framing
- Key ML Terminology
  
  Example is a particular instance of data, x
  
  Labeled example has {features, label}: (x, y)
  
  Used to train the model
  
  Unlabeled example has {features, ?}: (x, ?)
  
  Used for making predictions on new data
  
  Model maps examples to predicted labels: y'
  
  Defined by internal parameters, which are learned
  
  A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:
  
  What is the value of a house in California?
  
  What is the probability that a user will click on this ad?
  
  A classification model predicts discrete values. For example, classification models make predictions that answer questions like the following:
  
  Is a given email message spam or not spam?
  
  Is this an image of a dog, a cat, or a hamster?
  
  The labels applied to some examples might be unreliable.
  
  Definitely. It's important to check how reliable your data is. The labels for this dataset probably come from email users who mark particular email messages as spam. Since most users do not mark every suspicious email message as spam, we may have trouble knowing whether an email is spam. Furthermore, spammers could intentionally poison our model by providing faulty labels.
  
  "Shoe beauty" is not a useful feature.
  
  Good features are concrete and quantifiable. Beauty is too vague a concept to serve as a useful feature. Beauty is probably a blend of certain concrete features, such as style and color. Style and color would each be better features than beauty.
Descending into ML
- Linear regression is a method for finding the straight line or hyperplane that best fits a set of points. This module explores linear regression intuitively before laying the groundwork for a machine learning approach to linear regression.
- Linear Regression
  
  By convention in machine learning, you'll write the equation for a model slightly differently:
  
  $y^{'} = b + w_{1} x_{1}$
  
  where:
  
  To infer (predict) the temperature
  
  Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (
  
  $y^{'} = b + w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3}$
- $y^{'} = b + w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3}$
  
  Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.
  
  Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
  
  The linear regression models we'll examine here use a loss function called squared loss (also known as L₂ loss). The squared loss for a single example is as follows:
  
  = the square of the difference between the label and the prediction
  
  = (observation - prediction(x))²
  
  = (y - y')²
  
  Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
  
  MSE = 1 / N * ∑(x,y)∈D (y − prediction(x))2
  
  where:
  
  Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.
Reducing Loss
- To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.
- How do we reduce loss?
  
  Hyperparameters are the configuration settings used to tune how the model is trained.
  
  Derivative of (y - y')² with respect to the weights and biases tells us how loss changes for a given example
  
  Simple to compute and convex
  
  So we repeatedly take small steps in the direction that minimizes loss
  
  We call these Gradient Steps (But they're really negative Gradient Steps)
  
  This strategy is called Gradient Descent
- Weight Initialization
  
  For convex problems, weights can start anywhere (say, all 0s)
  
  Convex: think of a bowl shape
  
  Just one minimum
  
  Foreshadowing: not true for neural nets
  
  Non-convex: think of an egg crate
  
  More than one minimum
  
  Strong dependency on initial values
- SGD & Mini-Batch Gradient Descent
  
  Could compute gradient over entire data set on each step, but this turns out to be unnecessary
  
  Computing gradient on small data samples works well
  
  On every step, get a new random sample
  
  Stochastic Gradient Descent: one example at a time
  
  Mini-Batch Gradient Descent: batches of 10-1000
  
  Loss & gradients are averaged over the batch
- An Iterative Approach
  
  Key Point: A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.
- Gradient Descent
  
  Note: When performing gradient descent, we generalize the above process to tune all the model parameters simultaneously. For example, to find the optimal values of both
- NOTE: In practice, finding a "perfect" (or near-perfect) learning rate is not essential for successful model training. The goal is to find a learning rate large enough that gradient descent converges efficiently, but not so large that it never converges.
- Stochastic Gradient Descent
  
  In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration.
  
  By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent(SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.
  
  Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.
  
  When performing gradient descent on a large data set, which of the following batch sizes will likely be more efficient?
  
  A small batch or even a batch of one example (SGD).
  
  Amazingly enough, performing gradient descent on a small batch or even a batch of one example is usually more efficient than the full batch. After all, finding the gradient of one example is far cheaper than finding the gradient of millions of examples. To ensure a good representative sample, the algorithm scoops up another random small batch (or batch of one) on every iteration.
First Steps with TensorFlow
- TensorFlow API Hierarchy
A Quick Look at the tf.estimator API
1 import tensorflow as tf 2 3 # Set up a linear classifier. 4 classifier = tf.estimator.LinearClassifier(feature_columns) 5 6 # Train the model on some example data. 7 classifier.train(input_fn=train_input_fn, steps=2000) 8 9 # Use it to predict. 10 predictions = classifier.predict(input_fn=predict_input_fn)

Toolkit

Tensorflow is a computational framework for building machine learning models. TensorFlow provides a variety of different toolkits that allow you to construct models at your preferred level of abstraction. You can use lower-level APIs to build models by defining a series of mathematical operations. Alternatively, you can use higher-level APIs (like tf.estimator) to specify predefined architectures, such as linear regressors or neural networks.

The following table summarizes the purposes of the different layers:

Toolkit(s) Description

Estimator (tf.estimator) High-level, OOP API.

tf.layers/tf.losses/tf.metrics Libraries for common model components.

TensorFlow Lower-level APIs

TensorFlow consists of the following two components:

a graph protocol buffer

a runtime that executes the (distributed) graph

These two components are analogous to Python code and the Python interpreter. Just as the Python interpreter is implemented on multiple hardware platforms to run Python code, TensorFlow can run the graph on multiple hardware platforms, including CPU, GPU, and TPU.

Which API(s) should you use? You should use the highest level of abstraction that solves the problem. The higher levels of abstraction are easier to use, but are also (by design) less flexible. We recommend you start with the highest-level API first and get everything working. If you need additional flexibility for some special modeling concerns, move one level lower. Note that each level is built using the APIs in lower levels, so dropping down the hierarchy should be reasonably straightforward.

tf.estimator API

tf.estimator is compatible with the scikit-learn API. Scikit-learn is an extremely popular open-source ML library in Python, with over 100k users, including many at Google.

Generalization

Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.

The Big Picture

Goal: predict well on new data drawn from (hidden) true distribution.

Problem: we don't see the truth.

We only get to sample from it.

If model h fits our current sample well, how can we trust it will predict well on other new samples?

How Do We Know If Our Model Is Good?

Theoretically:

Interesting field: generalization theory

Based on ideas of measuring model simplicity / complexity

Intuition: formalization of Ockham's Razor principle

The less complex a model is, the more likely that a good empirical result is not just due to the peculiarities of our sample

Empirically:

Asking: will our model do well on a new sample of data?

Evaluate: get a new sample of data-call it the test set

Good performance on the test set is a useful indicator of good performance on the new data in general:

If the test set is large enough

If we don't cheat by using the test set over and over

The ML Fine Print

Three basic assumptions in all of the above:

We draw examples independently and identically (i.i.d.) at random from the distribution

The distribution is stationary: It doesn't change over time

We always pull from the same distribution: Including training, validation, and test sets

Peril of Overfitting

An overfit model gets a low loss during training but does a poor job predicting new data.

As you'll see later on, overfitting is caused by making a model more complex than necessary. The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.

The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.

In modern times, we've formalized Ockham's razor into the fields of statistical learning theory and computational learning theory. These fields have developed generalization bounds--a statistical description of a model's ability to generalize to new data based on factors such as:

the complexity of the model

the model's performance on training data

A machine learning model aims to make good predictions on new, previously unseen data. But if you are building a model from your data set, how would you get the previously unseen data? Well, one way is to divide your data set into two subsets:

training set—a subset to train a model.

test set—a subset to test the model.

Good performance on the test set is a useful indicator of good performance on the new data in general, assuming that:

The test set is large enough.

You don't cheat by using the same test set over and over.

Summary

Overfitting occurs when a model tries to fit the training data so closely that it does not generalize well to new data.

If the key assumptions of supervised ML are not met, then we lose important theoretical guarantees on our ability to predict on new data.

Training and Test Sets

A test set is a data set used to evaluate the model developed from a training set.

What If We Only Have One Data Set?

Divide into two sets:

training set

test set

Classic gotcha: do not train on test data

Getting surprisingly low loss?

Before celebrating, check if you're accidentally training on test data

Splitting Data

Make sure that your test set meets the following two conditions:

Is large enough to yield statistically meaningful results.

Is representative of the data set as a whole. In other words, don't pick a test set with different characteristics than the training set.

Never train on test data.

Validation Set

Check Your Intuition

We looked at a process of using a test set and a training set to drive iterations of model development. On each iteration, we'd train on the training data and evaluate on the test data, using the evaluation results on test data to guide choices of and changes to various model hyperparameters like learning rate and features. Is there anything wrong with this approach?

Doing many rounds of this procedure might cause us to implicitly fit to the peculiarities of our specific test set.

Yes indeed! The more often we evaluate on a given test set, the more we are at risk for implicitly overfitting to that one test set. We'll look at a better protocol next.

Partitioning a data set into a training set and test set lets you judge whether a given model will generalize well to new data. However, using only two partitions may be insufficient when doing many rounds of hyperparameter tuning.

Another Partition

Dividing the data set into two sets is a good idea, but not a panacea. You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:

Figure 2. Slicing a single data set into three subsets.

Use the validation set to evaluate results from the training set. Then, use the test set to double-check your evaluationafter the model has "passed" the validation set. The following figure shows this new workflow:

Figure 3. A better workflow.

In this improved workflow:

Pick the model that does best on the validation set.

Double-check that model against the test set.

This is a better workflow because it creates fewer exposures to the test set.

Tip

Test sets and validation sets "wear out" with repeated use. That is, the more you use the same data to make decisions about hyperparameter settings or other model improvements, the less confidence you'll have that these results actually generalize to new, unseen data. Note that validation sets typically wear out more slowly than test sets.

If possible, it's a good idea to collect more data to "refresh" the test set and validation set. Starting anew is a great reset.

Representation

A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.

Properties of a Good Feature

Feature values should appear with non-zero value more than a small handful of times in the dataset.

Features should have a clear, obvious meaning.

Features shouldn't take on "magic" values

The definition of a feature shouldn't change over time.

Distribution should not have crazy outliers

The Binning Trick

Create several boolean bins, each mapping to a new unique feature

Allows model to fit a different value for each bin

Good Habits

KNOW YOUR DATA

Visualize: Plot histograms, rank most to least common.

Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?

Monitor: Feature quantiles, number of examples over time?

Feature Engineering

In traditional programming, the focus is on code. In machine learning projects, the focus shifts to representation. That is, one way developers hone a model is by adding and improving its features.

Mapping Raw Data to Features

Feature engineering means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering.

Many machine learning models must represent the features as real-numbered vectors since the feature values must be multiplied by the model weights.

Mapping numeric values

Mapping categorical values

Categorical features have a discrete set of possible values.

Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values.

We can accomplish this by defining a mapping from the feature values, which we'll refer to as the vocabulary of possible values, to integers. Since not every street in the world will appear in our dataset, we can group all other streets into a catch-all "other" category, known as an OOV (out-of-vocabulary) bucket.

To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows:

For values that apply to the example, set corresponding vector elements to 1.

Set all other elements to 0.

The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1.

One-hot encoding extends to numeric data that you do not want to directly multiply by a weight, such as a postal code.

Sparse Representation

Suppose that you had 1,000,000 different street names in your data set that you wanted to include as values for street_name. Explicitly creating a binary vector of 1,000,000 elements where only 1 or 2 elements are true is a very inefficient representation in terms of both storage and computation time when processing these vectors. In this situation, a common approach is to use a sparse representation in which only nonzero values are stored. In sparse representations, an independent model weight is still learned for each feature value, as described above.

Qualities of Good Features

Avoid rarely used discrete feature values

Good feature values should appear more than 5 or so times in a data set. Doing so enables a model to learn how this feature value relates to the label. That is, having many examples with the same discrete value gives the model a chance to see the feature in different settings, and in turn, determine when it's a good predictor for the label.

Conversely, if a feature's value appears only once or very rarely, the model can't make predictions based on that feature.

Prefer clear and obvious meanings

Each feature should have a clear and obvious meaning to anyone on the project.

Conversely, the meaning of the following feature value is pretty much indecipherable to anyone but the engineer who created it:

house_age: 851472000

In some cases, noisy data (rather than bad engineering choices) causes unclear values.

Don't mix "magic" values with actual data

Good floating-point features don't contain peculiar out-of-range discontinuities or "magic" values.

However, if a user didn't enter a quality_rating, perhaps the data set represented its absence with a magic value like the following:

quality_rating: -1

To explicitly mark magic values, create a Boolean feature that indicates whether or not a quality_rating was supplied. Give this Boolean feature a name like is_quality_rating_defined.

In the original feature, replace the magic values as follows:

For variables that take a finite set of values (discrete variables), add a new value to the set and use it to signify that the feature value is missing.

For continuous variables, ensure missing values do not affect the model by using the mean value of the feature's data.

Account for upstream instability

The definition of a feature shouldn't change over time.

But gathering a value inferred by another model carries additional costs. Perhaps the value "219" currently represents Sao Paulo, but that representation could easily change on a future run of the other model:

inferred_city_cluster: "219"

Cleaning Data

As an ML engineer, you'll spend enormous amounts of your time tossing out bad examples and cleaning up the salvageable ones. Even a few "bad apples" can spoil a large data set.　　

Scaling feature values

Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:

Helps gradient descent converge more quickly.

Helps avoid the "NaN trap," in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and—due to math operations—every other number in the model also eventually becomes a NaN.

Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.

You don't have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000.

One obvious way to scale numerical data is to linearly map [min value, max value] to a small scale, such as [-1, +1].

Another popular scaling tactic is to calculate the Z score of each value. The Z score relates the number of standard deviations away from the mean. In other words:

$s c a l e d v a l u e = (v a l u e - m e a n) / s t d d e v .$

Handling extreme outliers

The plot shows that the vast majority of areas in California have one or two rooms per person. But take a look along the x-axis.

Figure 4. A verrrrry lonnnnnnng tail.

How could we minimize the influence of those extreme outliers? Well, one way would be to take the log of every value:

Figure 5. Logarithmic scaling still leaves a tail.

Log scaling does a slightly better job, but there's still a significant tail of outlier values. Let's pick yet another approach. What if we simply "cap" or "clip" the maximum value of roomsPerPerson at an arbitrary value, say 4.0?

Figure 6. Clipping feature values at 4.0

Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.

Binning

The following plot shows the relative prevalence of houses at different latitudes in California. Notice the clustering—Los Angeles is about at latitude 34 and San Francisco is roughly at latitude 38.

Figure 7. Houses per latitude.

In the data set, latitude is a floating-point value. However, it doesn't make sense to represent latitude as a floating-point feature in our model. That's because no linear relationship exists between latitude and housing values. For example, houses in latitude 35 are not

To make latitude a helpful predictor, let's divide latitudes into "bins" as suggested by the following figure:

Figure 8. Binning values.

Instead of having one floating-point feature, we now have 11 distinct boolean features (LatitudeBin1, LatitudeBin2, ..., LatitudeBin11). Having 11 separate features is somewhat inelegant, so let's unite them into a single 11-element vector. Doing so will enable us to represent latitude 37.4 as follows:

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Thanks to binning, our model can now learn completely different weights for each latitude.

For simplicity's sake in the latitude example, we used whole numbers as bin boundaries. Had we wanted finer-grain resolution, we could have split bin boundaries at, say, every tenth of a degree. Adding more bins enables the model to learn different behaviors from latitude 37.4 than latitude 37.5, but only if there are sufficient examples at each tenth of a latitude.

Another approach is to bin by quantile, which ensures that the number of examples in each bucket is equal. Binning by quantile completely removes the need to worry about outliers.

Scrubbing

In real-life, many examples in data sets are unreliable due to one or more of the following:

Omitted values. For instance, a person forgot to enter a value for a house's age.

Duplicate examples. For example, a server mistakenly uploaded the same logs twice.

Bad labels. For instance, a person mislabeled a picture of an oak tree as a maple.

Bad feature values. For example, someone typed in an extra digit, or a thermometer was left out in the sun.

Once detected, you typically "fix" bad examples by removing them from the data set. To detect omitted values or duplicated examples, you can write a simple program. Detecting bad feature values or labels can be far trickier.

In addition to detecting bad individual examples, you must also detect bad data in the aggregate. Histograms are a great mechanism for visualizing your data in the aggregate. In addition, getting statistics like the following can help:

Maximum and minimum

Mean and median

Standard deviation

Consider generating lists of the most common values for discrete features.

Know your data

Follow these rules:

Keep in mind what you think your data should look like.

Verify that the data meets these expectations (or that you can explain why it doesn’t).

Double-check that the training data agrees with other sources (for example, dashboards).

Treat your data with all the care that you would treat any mission-critical code. Good ML relies on good data.　　

Additional Information

Rules of Machine Learning, ML Phase II: Feature Engineering

Feature Crosses

A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.

Feature Crosses

Feature crosses is the name of this approach

Define templates of the form [A x B]

Can be complex: [A x B x C x D x E]

When A and B represent boolean features, such as bins, the resulting crosses can be extremely sparse

Feature Crosses: Why would we do this?

Linear learners use linear models

Such learners scale well to massive data e.g., vowpal-wabit, sofia-ml

But without feature crosses, the expressivity of these models would be limited

Using feature crosses + massive data is one efficient strategy for learning highly complex models

Foreshadowing: neural nets provide another

Encoding Nonlinearity

A feature cross is a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together. (The term cross comes from cross product.) Let's create a feature cross named

$x_{3} = x_{1} x_{2}$

$x_{3} = x_{1} x_{2}$

Crossing One-Hot Vectors

So far, we've focused on feature-crossing two individual floating-point features. In practice, machine learning models seldom cross continuous features. However, machine learning models do frequently cross one-hot feature vectors. Think of feature crosses of one-hot feature vectors as logical conjunctions.

Linear learners scale well to massive data. Using feature crosses on massive data sets is one efficient strategy for learning highly complex models. Neural networks provide another strategy.

Different cities in California have markedly different housing prices. Suppose you must create a model to predict housing prices. Which of the following sets of features or feature crosses could learn city-specific relationships between roomsPerPerson and housing price?

One feature cross: [binned latitude X binned longitude X binned roomsPerPerson]

Crossing binned latitude with binned longitude enables the model to learn city-specific effects of roomsPerPerson. Binning prevents a change in latitude producing the same result as a change in longitude. Depending on the granularity of the bins, this feature cross could learn city-specific or neighborhood-specific or even block-specific effects.

Regularization for Simplicity

Regularization means penalizing the complexity of a model to reduce overfitting.

Penalizing Model Complexity

We want to avoid model complexity where possible.

We can bake this idea into the optimization we do at training time.

Empirical Risk Minimization:

aims for low training error

$minimize: L o s s (D a t a | M o d e l)$

while balancing against complexity

$minimize: L o s s (D a t a | M o d e l) + c o m p l e x i t y (M o d e l)$

$minimize: L o s s (D a t a | M o d e l) + c o m p l e x i t y (M o d e l)$

How to define complexity(Model)?

Prefer smaller weights

Diverging from this should incur a cost

Can encode this idea via L₂regularization (a.k.a. ridge)

complexity(model) = sum of the squares of the weights

Penalizes really big weights

For linear models: prefers flatter slopes

Bayesian prior:

weights should be centered around zero

weights should be normally distributed

A Loss Function with L2 Regularization

$L o s s (D a t a | M o d e l) + λ (w_{1}^{2} + \dots + w_{n}^{2})$

In other words, this generalization curve shows that the model is overfitting to the data in the training set. Channeling our inner Ockham, perhaps we could prevent overfitting by penalizing complex models, a principle called regularization.

In other words, instead of simply aiming to minimize loss (empirical risk minimization):

$minimize(Loss(Data|Model))$

we'll now minimize loss+complexity, which is called structural risk minimization:

$minimize(Loss(Data|Model) + complexity(Model))$

Our training optimization algorithm is now a function of two terms: the loss term, which measures how well the model fits the data, and the regularization term, which measures model complexity.

Machine Learning Crash Course focuses on two common (and somewhat related) ways to think of model complexity:

Model complexity as a function of the weights of all the features in the model.

Model complexity as a function of the total number of features with nonzero weights. (A later module covers this approach.)

If model complexity is a function of weights, a feature weight with a high absolute value is more complex than a feature weight with a low absolute value.

We can quantify complexity using the L₂ regularization formula, which defines the regularization term as the sum of the squares of all the feature weights:

$L_{2} regularization term = | | w | |_{2}^{2} = w_{1}^{2} + w_{2}^{2} + . . . + w_{n}^{2}$

In this formula, weights close to zero have little effect on model complexity, while outlier weights can have a huge impact.

Lambda

Model developers tune the overall impact of the regularization term by multiplying its value by a scalar known as lambda(also called the regularization rate). That is, model developers aim to do the following:

$minimize(Loss(Data|Model) + λ complexity(Model))$

Performing L₂ regularization has the following effect on a model

Encourages weight values toward 0 (but not exactly 0)

Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.

Increasing the lambda value strengthens the regularization effect.

When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit:

If your lambda value is too high, your model will be simple, but you run the risk of underfitting your data. Your model won't learn enough about the training data to make useful predictions.

If your lambda value is too low, your model will be more complex, and you run the risk of overfitting your data. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.

Note: Setting lambda to zero removes regularization completely. In this case, training focuses exclusively on minimizing loss, which poses the highest possible overfitting risk.

The ideal value of lambda produces a model that generalizes well to new, previously unseen data. Unfortunately, that ideal value of lambda is data-dependent, so you'll need to do some tuning.

There's a close connection between learning rate and lambda. Strong L₂ regularization values tend to drive feature weights closer to 0. Lower learning rates (with early stopping) often produce the same effect because the steps away from 0 aren't as large. Consequently, tweaking learning rate and lambda simultaneously may have confounding effects.

Early stopping means ending training before the model fully reaches convergence. In practice, we often end up with some amount of implicit early stopping when training in an online (continuous) fashion. That is, some new trends just haven't had enough data yet to converge.

As noted, the effects from changes to regularization parameters can be confounded with the effects from changes in learning rate or number of iterations. One useful practice (when training across a fixed batch of data) is to give yourself a high enough number of iterations that early stopping doesn't play into things.

L₂ regularization may cause the model to learn a moderate weight for some non-informative features.

Surprisingly, this can happen when a non-informative feature happens to be correlated with the label. In this case, the model incorrectly gives such non-informative features some of the "credit" that should have gone to informative features.　　

L₂ regularization will encourage many of the non-informative weights to be nearly (but not exactly) 0.0.

L₂ regularization will force the features towards roughly equivalent weights that are approximately half of what they would have been had only one of the two features been in the model.

Logistic Regression

Instead of predicting exactly 0 or 1, logistic regression generates a probability—a value between 0 and 1, exclusive.

Many problems require a probability estimate as output

Enter Logistic Regression

Handy because the probability estimates are calibrated

for example, p(house will sell) * price = expected outcome

Also useful for when we need a binary classification

spam or not spam? → p(Spam)

Predictions

y′ = 1 / (1 + e−(wTx + b) )

LogLoss Defined

LogLoss = ∑(x,y)∈D −ylog(y′) − (1 − y)log(1 − y′)　　

Logistic Regression and Regularization

Regularization is super important for logistic regression.

Remember the asymptotes

It'll keep trying to drive loss to 0 in high dimensions

Two strategies are especially useful:

L₂ regularization (aka L₂ weight decay) - penalizes huge weights.

Early stopping - limiting training steps or learning rate.

Linear Logistic Regression　　　　

Linear logistic regression is extremely efficient.

Very fast training and prediction times.

Short / wide models use a lot of RAM.

Calculating a Probability　　　　

Many problems require a probability estimate as output. Logistic regression is an extremely efficient mechanism for calculating probabilities. Practically speaking, you can use the returned probability in either of the following two ways:

"As is"

Converted to a binary category.

In many cases, you'll map the logistic regression output into the solution to a binary classification problem, in which the goal is to correctly predict one of two possible labels (e.g., "spam" or "not spam"). A later module focuses on that.　　

If z represents the output of the linear layer of a model trained with logistic regression, then sigmoid(z) will yield a value (a probability) between 0 and 1. In mathematical terms:

$y^{'} = \frac{1}{1 + e^{- (z)}}$

where:

y' is the output of the logistic regression model for a particular example.

z is b + w₁x₁ + w₂x₂ + ... w_Nx_N

The w values are the model's learned weights, and b is the bias.

The x values are the feature values for a particular example.

Note that z is also referred to as the log-odds because the inverse of the sigmoid states that z can be defined as the log of the probability of the "1" label (e.g., "dog barks") divided by the probability of the "0" label (e.g., "dog doesn't bark"):

$z = l o g (\frac{y}{1 - y})$

Here is the sigmoid function with ML labels:

Figure 2: Logistic regression output.

Loss and Regularization

Loss function for Logistic Regression

The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows:

$Log Loss = \sum_{(x, y) \in D} - y \log (y^{'}) - (1 - y) \log (1 - y^{'})$

where:

The equation for Log Loss is closely related to Shannon's Entropy measure from Information Theory. It is also the negative logarithm of the likelihood function, assuming a Bernoulli distribution of

Regularization in Logistic Regression

Regularization is extremely important in logistic regression modeling. Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions. Consequently, most logistic regression models use one of the following two strategies to dampen model complexity:

L₂ regularization.

Early stopping, that is, limiting the number of training steps or the learning rate.

(We'll discuss a third strategy—L₁ regularization—in a later module.)

Summary

Logistic regression models generate probabilities.

Log Loss is the loss function for logistic regression.

Logistic regression is widely used by many practitioners.
Classification

Thresholding
- Logistic regression returns a probability. You can use the returned probability "as is" (for example, the probability that the user will click on this ad is 0.00023) or convert the returned probability to a binary value (for example, this email is spam).
- In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.
- Note: "Tuning" a threshold for logistic regression is different from tuning hyperparameters such as learning rate. Part of choosing a threshold is assessing how much you'll suffer for making a mistake. For example, mistakenly labeling a non-spam message as spam is very bad. However, mistakenly labeling a spam message as non-spam is unpleasant, but hardly the end of your job.
True vs. False and Positive vs. Negative
- A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
- A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
Accuracy
- Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:
  
  $Accuracy = \frac{Number of correct predictions}{Total number of predictions}$
- For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:
  
  $Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$
  
  Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.
- Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set, like this one, where there is a significant disparity between the number of positive and negative labels.
Precision and Recall
- Precision attempts to answer the following question:
  
  What proportion of positive identifications was actually correct?
- Precision is defined as follows:
  
  $Precision = \frac{T P}{T P + F P}$
- Note: A model that produces no false positives has a precision of 1.0.
- Recall attempts to answer the following question:
  
  What proportion of actual positives was identified correctly?
- Mathematically, recall is defined as follows:
  
  $Recall = \frac{T P}{T P + F N}$
- Note: A model that produces no false negatives has a recall of 1.0.
- To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa.
- Various metrics have been developed that rely on both precision and recall. For example, see F1 score.
- In general, raising the classification threshold reduces false positives, thus raising precision.
- Raising our classification threshold will cause the number of true positives to decrease or stay the same and will cause the number of false negatives to increase or stay the same. Thus, recall will either stay constant or decrease.
- In general, a model that outperforms another model on both precision and recall is likely the better model. Obviously, we'll need to make sure that comparison is being done at a precision / recall point that is useful in practice for this to be meaningful. For example, suppose our spam detection model needs to have at least 90% precision to be useful and avoid unnecessary false alarms. In this case, comparing one model at {20% precision, 99% recall} to another at {15% precision, 98% recall} is not particularly instructive, as neither model meets the 90% precision requirement. But with that caveat in mind, this is a good way to think about comparing models when using precision and recall.
ROC Curve and AUC
- An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
  
  True Positive Rate
  
  False Positive Rate
- True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
  
  $T P R = \frac{T P}{T P + F N}$
- False Positive Rate (FPR) is defined as follows:
  
  $F P R = \frac{F P}{F P + T N}$
- An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.
Figure 4. TP vs. FP rate at different classification thresholds.
- To compute the points in an ROC curve, we could evaluate a logistic regression model many times with different classification thresholds, but this would be inefficient. Fortunately, there's an efficient, sorting-based algorithm that can provide this information for us, called AUC.
- AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
Figure 5. AUC (Area under the ROC Curve).
- AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
- AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
- AUC is desirable for the following two reasons:
  
  AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
  
  AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.
- However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:
  
  Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.
  
  Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.
- In practice, if you have a "perfect" classifier with an AUC of 1.0, you should be suspicious, as it likely indicates a bug in your model. For example, you may have overfit to your training data, or the label data may be replicated in one of your features.
- AUC is based on the relative predictions, so any transformation of the predictions that preserves the relative ranking has no effect on AUC. This is clearly not the case for other metrics such as squared error, log loss, or prediction bias (discussed later).
Prediction Bias
- Logistic regression predictions should be unbiased. That is:
- "average of predictions" should ≈ "average of observations"
- Prediction bias is a quantity that measures how far apart those two averages are. That is:
- $prediction bias = average of predictions - average of labels in data set$
- Note: "Prediction bias" is a different quantity than bias (the b in wx + b).
- A significant nonzero prediction bias tells you there is a bug somewhere in your model, as it indicates that the model is wrong about how frequently positive labels occur.
- Possible root causes of prediction bias are:
  
  Incomplete feature set
  
  Noisy data set
  
  Buggy pipeline
  
  Biased training sample
  
  Overly strong regularization
- You might be tempted to correct prediction bias by post-processing the learned model—that is, by adding a calibration layer that adjusts your model's output to reduce the prediction bias.
- However, adding a calibration layer is a bad idea for the following reasons:
  
  You're fixing the symptom rather than the cause.
  
  You've built a more brittle system that you must now keep up to date.
- If possible, avoid calibration layers. Projects that use calibration layers tend to become reliant on them—using calibration layers to fix all their model's sins. Ultimately, maintaining the calibration layers can become a nightmare.
- Note: A good model will usually have near-zero bias. That said, a low prediction bias does not prove that your model is good. A really terrible model could have a zero prediction bias. For example, a model that just predicts the mean value for all examples would be a bad model, despite having zero bias.
- Logistic regression predicts a value between 0 and 1. However, all labeled examples are either exactly 0 (meaning, for example, "not spam") or exactly 1 (meaning, for example, "spam"). Therefore, when examining prediction bias, you cannot accurately determine the prediction bias based on only one example; you must examine the prediction bias on a "bucket" of examples. That is, prediction bias for logistic regression only makes sense when grouping enough examples together to be able to compare a predicted value (for example, 0.392) to observed values (for example, 0.394).
- You can form buckets in the following ways:
  
  Linearly breaking up the target predictions.
  
  Forming quantiles.
Regularization for Sparsity
- This module focuses on the special requirements for models learned on feature vectors that have many dimensions.
  
  Let's Go Back to Feature Crosses
  
  Caveat: Sparse feature crosses may significantly increase feature space
  
  Possible issues:
  
  Model size (RAM) may become huge
  
  "Noise" coefficients (causes overfitting)
  
  L₁ Regularization
  
  Would like to penalize L₀ norm of weights
  
  Non-convex optimization; NP hard
  
  Relax to L₁ regularization:
  
  Penalize sum of abs(weights)
  
  Convex problem
  
  Encourage sparsity unlike L₂
- L1 Regularization
  
  Sparse vectors often contain many dimensions. Creating a feature cross results in even more dimensions. Given such high-dimensional feature vectors, model size may become huge and require huge amounts of RAM.
  
  In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. A weight of exactly 0 essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model.
  
  L₂ regularization encourages weights to be small, but doesn't force them to exactly 0.0.
  
  An alternative idea would be to try and create a regularization term that penalizes the count of non-zero coefficient values in a model. Increasing this count would only be justified if there was a sufficient gain in the model's ability to fit the data. Unfortunately, while this count-based approach is intuitively appealing, it would turn our convex optimization problem into a non-convex optimization problem that's NP-hard. (If you squint, you can see a connection to the knapsack problem.) So this idea, known as L₀ regularization isn't something we can use effectively in practice.
  
  However, there is a regularization term called L₁ regularization that serves as an approximation to L₀, but has the advantage of being convex and thus efficient to compute. So we can use L₁ regularization to encourage many of the uninformative coefficients in our model to be exactly 0, and thus reap RAM savings at inference time.
  
  L1 vs L2 regularization.
  
  L2 and L1 penalize weights differently:
  
  L2 penalizes weight2.
  
  L1 penalizes |weight|.
  
  Consequently, L2 and L1 have different derivatives:
  
  The derivative of L2 is 2 * weight.
  
  The derivative of L1 is k (a constant, whose value is independent of weight).
  
  You can think of the derivative of L2 as a force that removes x% of the weight every time. As Zeno knew, even if you remove x percent of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, L2 does not normally drive weights to zero.
  
  You can think of the derivative of L1 as a force that subtracts some constant from the weight every time. However, thanks to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For example, if subtraction would have forced a weight from +0.1 to -0.2, L1 will set the weight to exactly 0. Eureka, L1 zeroed out the weight.
  
  L1 regularization—penalizing the absolute value of all the weights—turns out to be quite efficient for wide models.
- L1 regularization may cause informative features to get a weight of exactly 0.0.
  
  L1 regularization may cause the following kinds of features to be given weights of exactly 0:
  
  Weakly informative features.
  
  Strongly informative features on different scales.
  
  Informative features strongly correlated with other similarly informative features.
- L1 regularization will encourage most of the non-informative weights to be exactly 0.0.
  
  In general, L1 regularization of sufficient lambda tends to encourage non-informative features to weights of exactly 0.0. Unlike L2 regularization, L1 regularization "pushes" just as hard toward 0.0 no matter how far the weight is from 0.0.
- Which type of regularization will produce the smaller model?
  
  L₁ regularization tends to reduce the number of features. In other words, L₁ regularization often reduces the model size.
  
  L₂ regularization rarely reduces the number of features. In other words, L₂ regularization rarely reduces the model size.
Neural Networks
- Neural networks are a more sophisticated version of feature crosses. In essence, neural networks learn the appropriate feature crosses for you.
  
  A Linear Model
- Add Complexity: Non-Linear?
- More Complex: Non-Linear?
- Adding a Non-Linearity
- Our Favorite Non-Linearity
- Neural Nets Can Be Arbitrarily Complex
- Structure
  
  "Nonlinear" means that you can't accurately predict a label with a model of the form
  
  Hidden Layers
  
  In the model represented by the following graph, we've added a "hidden layer" of intermediary values. Each yellow node in the hidden layer is a weighted sum of the blue input node values. The output is a weighted sum of the yellow nodes.
  
  Activation Functions
  
  To model a nonlinear problem, we can directly introduce a nonlinearity. We can pipe each hidden layer node through a nonlinear function.
  
  In the model represented by the following graph, the value of each node in Hidden Layer 1 is transformed by a nonlinear function before being passed on to the weighted sums of the next layer. This nonlinear function is called the activation function.
  
  Common Activation Functions
  
  The following sigmoid activation function converts the weighted sum to a value between 0 and 1.
  
  $F (x) = \frac{1}{1 + e^{- x}}$
Figure 7. Sigmoid activation function.
- The following rectified linear unit activation function (or ReLU, for short) often works a little better than a smooth function like the sigmoid, while also being significantly easier to compute.
  
  $F (x) = m a x (0, x)$
  
  $F (x) = m a x (0, x)$ The superiority of ReLU is based on empirical findings, probably driven by ReLU having a more useful range of responsiveness. A sigmoid's responsiveness falls off relatively quickly on both sides.
Figure 8. ReLU activation function.
- In fact, any mathematical function can serve as an activation function. Suppose that
  
  $σ (w \cdot x + b)$
  
  TensorFlow provides out-of-the-box support for a wide variety of activation functions. That said, we still recommend starting with ReLU.
  
  Summary
  
  Now our model has all the standard components of what people usually mean when they say "neural network":
  
  A set of nodes, analogous to neurons, organized in layers.
  
  A set of weights representing the connections between each neural network layer and the layer beneath it. The layer beneath may be another neural network layer, or some other kind of layer.
  
  A set of biases, one for each node.
  
  An activation function that transforms the output of each node in a layer. Different layers may have different activation functions.
  
  A caveat: neural networks aren't necessarily always better than feature crosses, but neural networks do offer a flexible alternative that works well in many cases.
Training Neural Networks
- Backpropagation is the most common training algorithm for neural networks. It makes gradient descent feasible for multi-layer neural networks. TensorFlow handles backpropagation automatically, so you don't need a deep understanding of the algorithm. To get a sense of how it works, walk through the following: Backpropagation algorithm visual explanation. As you scroll through the preceding explanation, note the following:
  
  How data flows through the graph.
  
  How dynamic programming lets us avoid computing exponentially many paths through the graph. Here "dynamic programming" just means recording intermediate results on the forward and backward passes.
- Backprop: What You Need To Know
  
  Gradients are important
  
  If it's differentiable, we can probably learn on it
  
  Gradients can vanish
  
  Each additional layer can successively reduce signal vs. noise
  
  ReLus are useful here
  
  Gradients can explode
  
  Learning rates are important here
  
  Batch normalization (useful knob) can help
  
  ReLu layers can die
  
  Keep calm and lower your learning rates
- Normalizing Feature Values
  
  We'd like our features to have reasonable scales
  
  Roughly zero-centered, [-1, 1] range often works well
  
  Helps gradient descent converge; avoid NaN trap
  
  Avoiding outlier values can also help
  
  Can use a few standard methods:
  
  Linear scaling
  
  Hard cap (clipping) to max, min
  
  Log scaling
- Dropout Regularization
  
  Dropout: Another form of regularization, useful for NNs
  
  Works by randomly "dropping out" units in a network for a single gradient step
  
  There's a connection to ensemble models here
  
  The more you drop out, the stronger the regularization
  
  0.0 = no dropout regularization
  
  1.0 = drop everything out! learns nothing
  
  Intermediate values more useful
- Best Practices
  
  Failure Cases
  
  Vanishing Gradients
  
  The gradients for the lower layers (closer to the input) can become very small. In deep networks, computing these gradients can involve taking the product of many small terms.
  
  When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all.
  
  The ReLU activation function can help prevent vanishing gradients.
  
  Exploding Gradients
  
  If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge.
  
  Batch normalization can help prevent exploding gradients, as can lowering the learning rate.
  
  Dead ReLU Units
  
  Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network's output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0.
  
  Lowering the learning rate can help keep ReLU units from dying.
  
  Dropout Regularization
  
  Yet another form of regularization, called Dropout, is useful for neural networks. It works by randomly "dropping out" unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization:
  
  0.0 = No dropout regularization.
  
  1.0 = Drop out everything. The model learns nothing.
  
  Values between 0.0 and 1.0 = More useful.
Multi-Class Neural Networks
- In this module, we'll investigate multi-class classification, which can pick from multiple possibilities.
- Some real-world multi-class problems entail choosing from millions of separate classes.
- More than two classes?
  
  Logistic regression gives useful probabilities for binary-class problems.
  
  spam / not-spam
  
  click / not-click
  
  What about multi-class problems?
  
  apple, banana, car, cardiologist, ..., walk sign, zebra, zoo
  
  red, orange, yellow, green, blue, indigo, violet
  
  animal, vegetable, mineral
- One-Vs-All Multi-Class
  
  Create a unique output for each possible class
  
  Train that on a signal of "my class" vs "all other classes"
  
  Can do in a deep network, or with separate models
- SoftMax Multi-Class
  
  Add an additional constraint: Require output of all one-vs-all nodes to sum to 1.0
  
  The additional constraint helps training converge quickly
  
  Plus, allows outputs to be interpreted as probabilities
- What to use When?
  
  Multi-Class, Single-Label Classification:
  
  An example may be a member of only one class.
  
  Constraint that classes are mutually exclusive is helpful structure.
  
  Useful to encode this in the loss.
  
  Use one softmax loss for all possible classes.
  
  Multi-Class, Multi-Label Classification:
  
  An example may be a member of more than one class.
  
  No additional constraints on class membership to exploit.
  
  One logistic regression loss for each possible class.
- SoftMax Options
  
  Full SoftMax
  
  Brute force; calculates for all classes.
  
  Candidate Sampling
  
  Calculates for all the positive labels, but only for a random sample of negatives.
- One vs. All
  
  One vs. all provides a way to leverage binary classification. Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. During training, the model runs through a sequence of binary classifiers, training each to answer a separate classification question.
  
  This approach is fairly reasonable when the total number of classes is small, but becomes increasingly inefficient as the number of classes rises.
  
  We can create a significantly more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. The following figure suggests this approach:
Figure 1. A one-vs.-all neural network.
- Softmax
  
  Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.
  
  Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.
Figure 2. A Softmax layer within a neural network.
- The Softmax equation is as follows:
  
  $p (y = j | x) = \frac{e^{(w_{j}^{T} x + b_{j})}}{\sum_{k \in K} e^{(w_{k}^{T} x + b_{k})}}$
  
  Note that this formula basically extends the formula for logistic regression into multiple classes.
  
  Softmax Options
  
  Consider the following variants of Softmax:
  
  Full Softmax is the Softmax we've been discussing; that is, Softmax calculates a probability for every possible class.
  
  Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we don't have to provide probabilities for every non-doggy example.
  
  Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.
  
  One Label vs. Many Labels
  
  Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes.
  
  For example, suppose your examples are images containing exactly one item—a piece of fruit. Softmax can determine the likelihood of that one item being a pear, an orange, an apple, and so on. If your examples are images containing all sorts of things—bowls of different kinds of fruit—then you'll have to use multiple logistic regressions instead.
Embeddings
- An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.
ML Engineering

Fairness
- Evaluating a machine learning model responsibly requires doing more than just calculating loss metrics. Before putting a model into production, it's critical to audit training data and evaluate predictions for bias.
- This module looks at different types of human biases that can manifest in training data. It then provides strategies to identify them and evaluate their effects.
- What do you see?
  
  Bananas
  
  Stickers
  
  Bananas on shelves
Green Bananas

Unripe Bananas

Overripe Bananas

Good for Banana Bread

Yellow Bananas

Yellow is prototypical for bananas

Designing for Fairness

Consider the problem

Ask experts

Train the models to account for bias

Interpret outcomes

Publish with context

Types of Bias

Machine learning models are not inherently objective. Engineers train models by feeding them a data set of training examples, and human involvement in the provision and curation of this data can make a model's predictions susceptible to bias.

When building models, it's important to be aware of common human biases that can manifest in your data, so you can take proactive steps to mitigate their effects.

WARNING: The following inventory of biases provides just a small selection of biases that are often uncovered in machine learning data sets; this list is not intended to be exhaustive. Wikipedia's catalog of cognitive biases enumerates over 100 different types of human bias that can affect our judgment. When auditing your data, you should be on the lookout for any and all potential sources of bias that might skew your model's predictions.

Reporting Bias

Reporting bias occurs when the frequency of events, properties, and/or outcomes captured in a data set does not accurately reflect their real-world frequency. This bias can arise because people tend to focus on documenting circumstances that are unusual or especially memorable, assuming that the ordinary can "go without saying."

EXAMPLE: A sentiment-analysis model is trained to predict whether book reviews are positive or negative based on a corpus of user submissions to a popular website. The majority of reviews in the training data set reflect extreme opinions (reviewers who either loved or hated a book), because people were less likely to submit a review of a book if they did not respond to it strongly. As a result, the model is less able to correctly predict sentiment of reviews that use more subtle language to describe a book.

Automation Bias

Automation bias is a tendency to favor results generated by automated systems over those generated by non-automated systems, irrespective of the error rates of each.

EXAMPLE: Software engineers working for a sprocket manufacturer were eager to deploy the new "groundbreaking" model they trained to identify tooth defects, until the factory supervisor pointed out that the model's precision and recall rates were both 15% lower than those of human inspectors.

Selection Bias

Selection bias occurs if a data set's examples are chosen in a way that is not reflective of their real-world distribution. Selection bias can take many different forms:

Coverage bias: Data is not selected in a representative fashion.

EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product. Consumers who instead opted to buy a competing product were not surveyed, and as a result, this group of people was not represented in the training data.

Non-response bias (or participation bias): Data ends up being unrepresentative due to participation gaps in the data-collection process.

EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Consumers who bought the competing product were 80% more likely to refuse to complete the survey, and their data was underrepresented in the sample.

Sampling bias: Proper randomization is not used during data collection.

EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Instead of randomly targeting consumers, the surveyer chose the first 200 consumers that responded to an email, who might have been more enthusiastic about the product than average purchasers.

Group Attribution Bias

Group attribution bias is a tendency to generalize what is true of individuals to an entire group to which they belong. Two key manifestations of this bias are:

In-group bias: A preference for members of a group to which you also belong, or for characteristics that you also share.

EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that applicants who attended the same computer-science academy as they both did are more qualified for the role.

Out-group homogeneity bias: A tendency to stereotype individual members of a group to which you do not belong, or to see their characteristics as more uniform.

EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that all applicants who did not attend a computer-science academy do not have sufficient expertise for the role.

Implicit Bias

Implicit bias occurs when assumptions are made based on one's own mental models and personal experiences that do not necessarily apply more generally.

EXAMPLE: An engineer training a gesture-recognition model uses a head shake as a feature to indicate a person is communicating the word "no." However, in some regions of the world, a head shake actually signifies "yes."

A common form of implicit bias is confirmation bias, where model builders unconsciously process data in ways that affirm preexisting beliefs and hypotheses. In some cases, a model builder may actually keep training a model until it produces a result that aligns with their original hypothesis; this is called experimenter's bias.

EXAMPLE: An engineer is building a model that predicts aggressiveness in dogs based on a variety of features (height, weight, breed, environment). The engineer had an unpleasant encounter with a hyperactive toy poodle as a child, and ever since has associated the breed with aggression. When the trained model predicted most toy poodles to be relatively docile, the engineer retrained the model several more times until it produced a result showing smaller poodles to be more violent.

Identifying Bias

As you explore your data to determine how best to represent it in your model, it's important to also keep issues of fairness in mind and proactively audit for potential sources of bias.

Missing Feature Values

If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.

Before training a model on this data, it would be prudent to investigate the cause of these missing values to ensure that there are no latent biases responsible for missing income and population data.

Unexpected Feature Values

When exploring data, you should also look for examples that contain feature values that stand out as especially uncharacteristic or unusual. These unexpected feature values could indicate problems that occurred during data collection or other inaccuracies that could introduce bias.

Data Skew

Any sort of skew in your data, where certain groups or characteristics may be under- or over-represented relative to their real-world prevalence, can introduce bias into your model.

Evaluating for Bias

When evaluating a model, metrics calculated against an entire test or validation set don't always give an accurate picture of how fair the model is.

Consider a new model developed to predict the presence of tumors that is evaluated against a validation set of 1,000 patients' medical records. 500 records are from female patients, and 500 records are from male patients.

When we calculate metrics separately for female and male patients, we see stark differences in model performance for each group.

We now have a much better understanding of the biases inherent in the model's predictions, as well as the risks to each subgroup if the model were to be released for medical use in the general population.

Additional Fairness Resources

Fairness is a relatively new subfield within the discipline of machine learning. To learn more about research and initiatives devoted to developing new tools and techniques for identifying and mitigating bias in machine learning models, check out Google's Machine Learning Fairness resources page.
查看全文

相关阅读:
redis线程模型
 同步容器和并发容器
 200+面试题
 redis pipeline
redis事务和脚本
 redis事务
 redis优缺点
 redis持久化策略
 Redis为什么要把所有数据放到内存中？
redis的过期策略以及内存淘汰机制

原文地址：https://www.cnblogs.com/pegasus923/p/10508444.html

Copyright © 2011-2022 走看看