Introduction to Neural Networks

A Brief Introduction to Neural Networks

Architecture: How are neurons connected together? How do we pick the activation function? ...
Training: How do we modify the connection weights so that the network can learn a task?

Suppose you are given is a task/behavior to perform.
To learn this task, you are presented with a set of examples (x₁,t₁) to (x_n,t_n), where x is the stimulus and t is the response.
From these examples you somehow figure out the relationship. (i.e. determine the mapping from x to t.)
Generalization: given an input that you have never seen, you are asked to predict the appropriate behavior. (i.e. given an new x, what is t.)?
Examples:
Neural networks are unique because that are a method for doing nonparametric estimation. Given a task (e.g. classifying iris)
- We do not need to know rules underlying the task.
- We do not have to assume a particular functional form for input/output relationship.
- Instead, we "present" the network with a representative set of examples of the task and the network through training "learns" the appropriate relationship.

How neurons are connected together? Most common approaches are

Regression: linear, non-linear (Example, Auto Data)
Classification: linear, nonlinear (Example: Iris types)
Unsupervised learning
- Often used for compression or for feature extraction.
- Clustering (kmeans, vector quantization)
- Principal component analysis

Details depend on application.
Early training algorithms for one layer networks
- Perceptron Learning
- Delta Rule
Most learning algorithms though are a form of nonlinear recursive optimization:
- Define an loss (aka cost, objective, energy, error) function that quantifies how well the network is performing.
- Perform gradient descent to minimize loss function.
- Backpropagation (really just a form of gradient descent)

The loss function is typically defined as the negative log-likelihood
E = - ln P(t|x,w)
Regression: P is gaussian resulting in
- E = mean squared error unsupervised learning
Classification results in E = cross entropy. Two cases
- P is a binomial distribution, activation function is sigmoid
- P corresponds to one-of-many, activation is the softmax function
For details of these see Table.