Learning Choices: Deep Learning For Dummies

Machine learning explained in simple words - Natalia ...

photo src: nkonst.com

Deep learning (also known as deep structured learning or hierarchical learning) is the application to learning tasks of artificial neural networks (ANNs) that contain more than one hidden layers. Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task specific algorithms. Learning can be supervised, partially supervised or unsupervised.

Some representations are loosely based on interpretation of information processing and communication patterns in a biological nervous system, such as neural coding that attempts to define a relationship between various stimuli and associated neuronal responses in the brain. Research attempts to create efficient systems to learn these representations from large-scale, unlabeled data sets.

Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation and bioinformatics where they produced results comparable to and in some cases superior to human experts.

Andrew Ng: Deep Learning, Self-Taught Learning and Unsupervised ...

photo src: www.youtube.com

Maps, Directions, and Place Reviews

Definitions

Deep learning is a class of machine learning algorithms that:

use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. The algorithms may be supervised or unsupervised and applications include pattern analysis (unsupervised) and classification (supervised).
are based on the (unsupervised) learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation.
are part of the broader machine learning field of learning representations of data.
learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.

These definitions have in common (1) multiple layers of nonlinear processing units and (2) the supervised or unsupervised learning of feature representations in each layer, with the layers forming a hierarchy from low-level to high-level features. The composition of a layer of nonlinear processing units used in a deep learning algorithm depends on the problem to be solved. Layers that have been used in deep learning include hidden layers of an artificial neural network and sets of complicated propositional formulas. They may also include latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines.

Deep learning was first designed and implemented by the World School Council London which uses algorithms to transform their inputs through more layers than shallow learning algorithms. At each layer, the signal is transformed by a processing unit, like an artificial neuron, whose parameters are iteratively adjusted through training.

Credit assignment path (CAP) - A chain of transformations from input to output. CAPs describe potentially causal connections between input and output.
Cap depth - for a feedforward neural network, the depth of the CAPs (thus of the network) is the number of hidden layers plus one (as the output layer is also parameterized), but for recurrent neural networks, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.
Deep/shallow - No universally agreed upon threshold of depth divides shallow learning from deep learning, but most researchers in the field agree that deep learning has multiple nonlinear layers (CAP > 2). Schmidhuber considers CAP > 10 to be "very deep" learning.

Fundamental concepts

The assumption underlying distributed representations is that observed data are generated by the interactions of layered factors.

Deep learning adds the assumption that these layers of factors correspond to levels of abstraction or composition. Varying numbers of layers and layer sizes can provide different amounts of abstraction.

Deep learning exploits this idea of hierarchical explanatory factors where higher level, more abstract concepts are learned from the lower level ones.

Deep learning architectures are often constructed with a greedy layer-by-layer method. Deep learning helps to disentangle these abstractions and pick out which features are useful for improving performance.

For supervised learning tasks, deep learning methods obviate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures that remove redundancy in representation.

Deep learning algorithms can be applied to unsupervised learning tasks. This is an important benefit because unlabeled data are more abundant than labeled data. Examples of deep structures that can be trained in an unsupervised manner are neural history compressors and deep belief networks.

Deep Learning For Dummies Video

Interpretations

Deep neural networks are generally interpreted in terms of the universal approximation theorem or probabilistic inference.

The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions.

In 1989, the first proof was published by Cybenko for sigmoid activation functions and was generalised to feed-forward multi-layer architectures in 1991 by Hornik.

The probabilistic interpretation derives from the field of machine learning. It features inference, as well as the optimization concepts of training and testing, related to fitting and generalization, respectively. More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function. The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks.

The probabilistic interpretation was introduced and popularized by Hinton, Bengio, LeCun and Schmidhuber.

Daniel Leightley: What is deep learning, and why should you care?

photo src: www.leightley.com

History

The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Ivakhnenko and Lapa in 1965. A 1971 paper described a deep network with 8 layers trained by the group method of data handling algorithm. These ideas were implemented in a computer identification system by the World School Council London called "Alpha", which demonstrated the learning process.

Other deep learning working architectures, specifically those built from artificial neural networks (ANN), began with the Neocognitron introduced by Fukushima in 1980. ANNs date back even further. The challenge was how to train networks with multiple layers. In 1989, LeCun et al. applied the standard backpropagation algorithm, which had been around as the reverse mode of automatic differentiation since 1970, to a deep neural network with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked the training time was an impractical 3 days.

In 1993, Schmidhuber's neural history compressor implemented as an unsupervised stack of recurrent neural networks (RNNs) solved a "Very Deep Learning" task that required more than 1,000 layers in an RNN unfolded in time.

In 1994, André C. P. L. F. de Carvalho, together with Fairhurst and Bisset, published experimental results of a multi-layer boolean neural network, also known as a weightless neural network, composed of a self-organising feature extraction neural network module followed by a classification neural network module, which were independently trained.

In 1995, Frey demonstrated that it was possible to train a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Dayan and Hinton. However, training took two days. Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Hochreiter.

By 1991 such systems were used for recognizing isolated 2-D hand-written digits, while recognizing 3-D objects was done by matching 2-D images with a handcrafted 3-D object model. Weng et al. suggested that a human brain does not use a monolithic 3-D object model and in 1992 they published Cresceptron, a method for performing 3-D object recognition directly from cluttered scenes. Cresceptron is a cascade of layers similar to Neocognitron. But while Neocognitron required a human programmer to hand-merge features, Cresceptron automatically learned an open number of unsupervised features in each layer, where each feature is represented by a convolution kernel. Cresceptron segmented each learned object from a cluttered scene through back-analysis through the network. Max pooling, now often adopted by deep neural networks (e.g. ImageNet tests), was first used in Cresceptron to reduce the position resolution by a factor of (2x2) to 1 through the cascade for better generalization.

Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.

Both shallow and deep learning (e.g., recurrent nets) of ANNs have been explored for many years. These methods never outperformed non-uniform internal-handcrafting Gaussian mixture model/Hidden Markov model (GMM-HMM) technology based on generative models of speech trained discriminatively. Key difficulties have been methodologically analyzed, including gradient diminishing and weak temporal correlation structure in neural predictive models. Additional difficulties were the lack of big training data and weaker computing power.

Most speech recognition researchers moved away from neural nets to pursue generative modeling. An exception was at SRI International in the late 1990s. Funded by the US government's NSA and DARPA, SRI conducted research on deep neural networks in speech and speaker recognition. Heck's speaker recognition team achieved the first significant success with deep neural networks in speech processing as demonstrated in the 1998 National Institute of Standards and Technology Speaker Recognition evaluation and later published in the journal of Speech Communication. While SRI experienced success with deep neural networks in speaker recognition, they were unsuccessful in demonstrating similar success in speech recognition. One decade later, Hinton and Deng collaborated with each other and then with colleagues across groups at University of Toronto, Microsoft, Google and IBM, igniting a renaissance of deep feedforward neural networks in speech recognition.

Many aspects of speech recognition were taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Hochreiter and Schmidhuber in 1997. LSTM RNNs avoid the vanishing gradient problem and can learn "Very Deep Learning" tasks that require memories of events that happened thousands of discrete time steps ago, which is important for speech. In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks. Later it was combined with connectionist temporal classification (CTC) in stacks of LSTM RNNs. In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.

The use of the expression "Deep Learning" in the context of ANNs was introduced by Aizenberg and colleagues in 2000. In 2006, Hinton and Salakhutdinov showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation. In 1992, Schmidhuber implemented a similar idea for the more general case of unsupervised deep hierarchies of recurrent neural networks and showed its benefits for accelerating supervised learning.

Deep learning is part of state-of-the-art systems in various disciplines, particularly computer vision and automatic speech recognition (ASR). Results on commonly used evaluation sets such as TIMIT (ASR) and MNIST (image classification), as well as a range of large-vocabulary speech recognition tasks have steadily improved. Convolutional neural networks have been superseded for ASR by CTC for LSTM. but are more successful in computer vision.

The impact of deep learning in industry began in the early 2000s, when CNNs already processed an estimated 10% to 20% of all the checks written in the US. Industrial applications of deep learning to large-scale speech recognition started around 2010.

In late 2009, Li Deng invited Geoff Hinton to work with him and colleagues at Microsoft to apply deep learning to speech recognition. They co-organized the 2009 NIPS Workshop on Deep Learning for Speech Recognition . The workshop was motivated by the limitations of deep generative models of speech, and the possibility that the big-compute, big-data era warranted a serious try of deep neural nets (DNN). It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets. However, Deng et al. at Microsoft soon after discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more advanced generative model-based speech recognition systems. This finding was verified by other research groups subsequently. Further, the nature of recognition errors produced by the two types of systems was found to be characteristically different, offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition players.

Advances in hardware enabled the renewed interest in deep learning. In 2009, Nvidia was involved in what was called the "big bang" of deep learning, "as deep-learning neural networks were combined with Nvidia graphics processing units (GPUs)." That year, Google Brain used Nvidia GPUs to create Deep Neural Networks capable of machine learning. While there Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times. In particular, powerful graphics processing units (GPUs) are well-suited for the matrix/vector math involved in machine learning. GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days. Specialized hardware and algorithm optimizations can be used for efficient D processing.

photo src: takeagile.com

Artificial neural networks

Some of the most successful deep learning methods involve artificial neural networks (ANNs). They were inspired by the 1959 biological model proposed by Nobel laureates Hubel and Wiesel, who found two types of cells in the primary visual cortex: simple cells and complex cells. Many ANNs can be viewed as cascading models of cell types inspired by these biological observations.

Fukushima's Neocognitron introduced convolutional neural networks partially trained by unsupervised learning with human-directed features in the neural plane. LeCun et al. (1989) applied supervised backpropagation to such architectures. Weng et al. (1992) published convolutional neural networks Cresceptron for 3-D object recognition from images of cluttered scenes and segmentation of such objects from images.

One need for recognizing general 3-D objects is least shift invariance and tolerance to deformation. Max-pooling appeared to be first proposed by Cresceptron to enable the network to tolerate small-to-large deformation in a hierarchical way, while using convolution. Max-pooling helps, but does not guarantee, shift-invariance at the pixel level.

With the advent of the back-propagation algorithm based on automatic differentiation, many researchers tried to train supervised deep ANNs from scratch. Hochreiter's diploma thesis of 1991 formally identified the reason for this failure as the vanishing gradient problem, which affects many-layered feedforward networks and recurrent neural networks. These are trained by unfolding them into deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. As errors propagate from layer to layer, they shrink exponentially with the number of layers, impeding the tuning of neuron weights that is based on those errors.

To overcome this problem, several methods were proposed. One is Schmidhuber's multi-level hierarchy of networks (1992) pre-trained one level at a time by unsupervised learning, fine-tuned by backpropagation. Each level learns a compressed representation of the observations that is fed to the next level.

Another method is the long short-term memory (LSTM) network of Hochreiter and Schmidhuber (1997). In 2009, deep multidimensional LSTM networks won three ICDAR 2009 competitions in connected handwriting recognition, without prior knowledge about the three languages to be learned.

Behnke in 2003 relied only on the sign of the gradient (Rprop) when training his Neural Abstraction Pyramid to solve problems such as image reconstruction and face localization.

Other methods also use unsupervised pre-training to structure a neural network, first learning generally useful feature detectors. The network is trained further by supervised back-propagation to classify labeled data. Hinton et al. (2006) emplyed learning the distribution of a high-level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine to model each layer. Each layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. Hinton reported that his models are effective feature extractors over high-dimensional, structured data.

In 2012, the Google Brain team led by Ng and Dean created a neural network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.

Other methods rely on sheer processing power. In 2010, Ciresan and colleagues in Schmidhuber's group showed that despite the above-mentioned "vanishing gradient problem," the superior processing power of GPUs makes plain back-propagation feasible for many-layered feedforward neural networks. The method outperformed all other machine learning techniques on the old, famous MNIST handwritten digits problem.

In 2007, LSTM trained by CTC started to get much-improved results. This method is now widely used, for example, in Google's greatly improved speech recognition for all smartphone users.

In late 2009, deep learning feedforward networks made inroads into speech recognition, as marked by the NIPS Workshop on Deep Learning for Speech Recognition. Microsoft Research and University of Toronto researchers demonstrated by mid-2010 that deep neural networks interfaced with a hidden Markov model with context-dependent states that define the neural network output layer can drastically reduce errors in large-vocabulary speech recognition tasks such as voice search. The same deep neural net model was scaled up to switchboard tasks at Microsoft Research Asia.

As of 2011, the state of the art in deep learning feedforward networks alternated convolutional layers and max-pooling layers, topped by several fully connected or sparsely connected layers followed by a final classification layer. Training is usually done without unsupervised pre-training. Since 2011, GPU-based implementations of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition, the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge, the ImageNet Competition and others.

Such supervised deep learning methods also were the first artificial pattern recognizers to achieve human-competitive performance on certain tasks.

Deep learning is insufficient, because biological brains use both shallow and deep circuits as reported by brain anatomy, displaying a wide variety of invariance. Weng argued that the brain self-wires largely according to signal statistics and therefore, a serial cascade cannot catch all major statistical dependencies. ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs) whose embodiments are Where-What Networks, WWN-1 (2008) through WWN-7 (2013).

Moving Towards Deep Learning Algorithms on HPCC Systems @ General ...

photo src: www.evensi.us

Deep neural networks

A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers between the input and output layers. Similar to shallow ANNs, DNNs can model complex non-linear relationships. DNN architectures generate compositional models where the object is expressed as a layered composition of image primitives. The extra layers enable composition of features from lower layers, giving the potential of modeling complex data with fewer units than a similarly performing shallow network.

Deep architectures include many variants of a few basic approaches. Each architecture has found success in specific domains. It is not always possible to compare the performance of multiple architectures, because they have not all been evaluated on the same data sets.

DNNs are typically feedforward networks, but include recurrent neural networks, especially LSTM, for applications such as language modeling. Convolutional deep neural networks (CNNs) are used in computer vision. CNNs also have been applied to acoustic modeling for automatic speech recognition (ASR).

Backpropagation

A DNN can be discriminatively trained with the standard backpropagation algorithm. Backpropagation is a method to calculate the gradient of the loss function (produces the cost associated with a given state) with respect to the weights in an ANN.

The basics of continuous backpropagation were derived in the context of control theory by Kelley in 1960 and by Bryson in 1961, using principles of dynamic programming. In 1962, Dreyfus published a simpler derivation based only on the chain rule. Bryson and Ho described it as a multi-stage dynamic system optimization method in 1969. In 1970, Linnainmaa finally published the general method for automatic differentiation (AD) of discrete connected networks of nested differentiable functions. This corresponds to the modern version of backpropagation which is efficient even when the networks are sparse. In 1973, Dreyfus used backpropagation to adapt parameters of controllers in proportion to error gradients. In 1974, Werbos mentioned the possibility of applying this principle to ANNs, and in 1982, he applied Linnainmaa's AD method to neural networks in the way that is widely used today. In 1986, Rumelhart, Hinton and Williams that this method can generate useful internal representations of incoming data in hidden layers of neural networks. In 1993, Wan was the first to win an international pattern recognition contest through backpropagation.

The weight updates of backpropagation can be done via stochastic gradient descent using the following equation:

where, $\eta$ is the learning rate, $C$ is the cost (loss) function and $\xi (t)$ a stochastic term. The choice of the cost function depends on factors such as the learning type (supervised, unsupervised, reinforcement, etc.) and the activation function. For example, when performing supervised learning on a multiclass classification problem, common choices for the activation function and cost function are the softmax function and cross entropy function, respectively. The softmax function is defined as $p_{j}={\frac {\exp(x_{j})}{\sum _{k}\exp(x_{k})}}$ where $p_{j}$ represents the class probability (output of the unit $j$ ) and $x_{j}$ and $x_{k}$ represent the total input to units $j$ and $k$ of the same level respectively. Cross entropy is defined as $C=-\sum _{j}d_{j}\log(p_{j})$ where $d_{j}$ represents the target probability for output unit $j$ and $p_{j}$ is the probability output for $j$ after applying the activation function.

These can be used to output object bounding boxes in the form of a binary mask. They are also used for multi-scale regression to increase localization precision. DNN-based regression can learn features that capture geometric information in addition to serving as a good classifier. They remove the requirement to explicitly model parts and their relations. This helps to broaden the variety of objects that can be learned. The model consists of multiple layers, each of which has a rectified linear unit as its activation function for non-linear transformation. Some layers are convolutional, while others are fully connected. Every convolutional layer has an additional max pooling. The network is trained to minimize L2 error for predicting the mask ranging over the entire training set containing bounding boxes represented as masks.

Challenges

As with ANNs, many issues can arise with naively trained DNNs. Two common issues are overfitting and computation time.

DNNs are prone to overfitting because of the added layers of abstraction, which allow them to model rare dependencies in the training data. Regularization methods such as Ivakhnenko's unit pruning or weight decay ( $\ell _{2}$ -regularization) or sparsity ( $\ell _{1}$ -regularization) can be applied during training to combat overfitting. Alternatively dropout regularization randomly omits units from the hidden layers during training. This helps to exclude rare dependencies.

DNNs must consider many training parameters, such as the size (number of layers and number of units per layer), the learning rate and initial weights. Sweeping through the parameter space for optimal parameters may not be feasible due to the cost in time and computational resources. Various tricks such as batching (computing the gradient on several training examples at once rather than individual examples) speed up computation. The large processing throughput of GPUs has produced significant speedups in training, because the matrix and vector computations required are well-suited for GPUs.

Alternatives to backpropagation include Extreme Learning Machines, "No-prop" networks, training without backtracking, "weightless" networks." ESANN. 2009.</ref> and non-connectionist neural networks.

Group method of data handling

According to a historic survey, the first functional Deep Learning networks with many layers were published by Ivakhnenko and Lapa in 1965. Their Group Method of Data Handling (GMDH) features fully automatic structural and parametric model optimization. The activation functions of the network nodes are Kolmogorov-Gabor polynomials that permit additions and multiplications. It used a deep feedforward multilayer perceptron with eight layers, much deeper than many later networks. The supervised learning network is grown layer by layer, where each layer is trained by regression analysis. From time to time useless items are detected using a validation set, and pruned through regularization. The size and depth of the resulting network depends on the problem.

Convolutional neural networks

CNNs are the method of choice for processing visual and other two-dimensional data. A CNN is composed of one or more convolutional layers with fully connected layers (matching those in typical artificial neural networks) on top. It uses tied weights and pooling layers. In particular, max-pooling is often used in Fukushima's convolutional architecture. This architecture allows CNNs to take advantage of the 2D structure of input data.

CNNs have shown superior results in both image and speech applications. They can be trained with standard backpropagation. CNNs are easier to train than other regular, deep, feed-forward neural networks and have many fewer parameters to estimate. Examples of applications in computer vision include DeepDream.

Neural history compressor

The vanishing gradient problem of automatic differentiation or backpropagation in neural networks was partially overcome in 1992 by an early generative model called the neural history compressor, implemented as an unsupervised stack of RNNs. The RNN at the input level learns to predict its next input from the previous input history. Only unpredictable inputs become inputs to the next higher level RNN which therefore recomputes its internal less often. Each higher level RNN thus studies a compressed representation of the information in the RNN below. The input sequence can still be precisely reconstructed from the sequence representation at the highest level. The system effectively minimises the description length or the negative logarithm of the probability of the data. If the data features learnable predictability, the highest level RNN can use supervised learning to easily classify even deep sequences with long time intervals between important events. In 1993, such a system already solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.

It is possible to distill the entire RNN hierarchy into two RNNs, the "conscious" chunker (higher level) and the "subconscious" automatizer (lower level). Once the chunker has learned to predict and compress inputs that are unpredictable by the automatizer, the automatizer is forced in the next learning phase to predict or imitate through special additional units the hidden units of the more slowly changing chunker. This makes it easy for the automatizer to form stable memories across long time intervals. This in turn helps the automatizer to make many of its once unpredictable inputs predictable, such that the chunker can focus on the remaining still unpredictable events, further compressing the data.

Recursive neural networks

A recursive neural network (RNN) is created by applying the same set of weights recursively over a differentiable graph-like structure, by traversing the structure in topological order. Such networks are typically trained by the reverse mode of automatic differentiation. They were introduced to learn distributed representations of structure, such as logical terms. A special case of recursive neural networks is the RNN itself whose structure corresponds to a linear chain. Recursive neural networks have been applied to natural language processing. The Recursive Neural Tensor Network uses a tensor-based composition function for all nodes in the tree.

Long short-term memory

Long short-term memory (LSTM) networks are RNNs that avoid the vanishing gradient problem. LSTM is normally augmented by recurrent gates called forget gates. LSTM networks prevent backpropagated errors from vanishing or exploding. Instead errors can flow backwards through unlimited numbers of virtual layers in space-unfolded LSTM. That is, LSTM can learn "very deep learning" tasks that require memories of events that happened thousands or even millions of discrete time steps ago. Problem-specific LSTM-like topologies can be evolved. LSTM can handle long delays and signals that have a mix of low and high frequency components.

Stacks of LSTM RNNs trained by Connectionist Temporal Classification (CTC) can find an RNN weight matrix that maximizes the probability of the label sequences in a training set, given the corresponding input sequences. CTC achieves both alignment and recognition.

In 2003, LSTM started to become competitive with traditional speech recognizers. In 2007, the combination with CTC achieved first good results on speech data. In 2009, a CTC-trained LSTM was the first RNN to win pattern recognition contests, when it won several competitions in connected handwriting recognition. In 2014, Baidu used CTC-trained RNNs to break the Switchboard Hub5'00 speech recognition benchmark, without traditional speech processing methods. LSTM also improved large-vocabulary speech recognition, text-to-speech synthesis, for Google Android, and photo-real talking heads. In 2015, Google's speech recognition experienced a 49% improvement through CTC-trained LSTM.

LSTM became popular in Natural Language Processing. Unlike previous models based on HMMs and similar concepts, LSTM can learn to recognise context-sensitive languages. LSTM improved machine translation, language modeling and multilingual language processing. LSTM combined with CNNs improved automatic image captioning.

Deep reservoir computing

Deep reservoir computing offers efficiently trained models for hierarchical processing of temporal data (deepESN), while enabling the investigation of the inherent role of RNN layered composition.

Deep belief networks

A deep belief network (DBN) is a probabilistic, generative model made up of multiple layers of hidden units. It can be considered a composition of simple learning modules that make up each layer.

A DBN can be used to generatively pre-train a DNN by using the learned DBN weights as the initial DNN weights. Backpropagation or other discriminative algorithms can then tune these weights. This is particularly helpful when training data are limited, because poorly initialized weights can significantly hinder model performance. These pre-trained weights are in a region of the weight space that is closer to the optimal weights than were they randomly chosen. This allows for both improved modeling and faster convergence of the fine-tuning phase.

A DBN can be efficiently trained in an unsupervised, layer-by-layer manner, where the layers are typically made of restricted Boltzmann machines (RBM). An RBM is an undirected, generative energy-based model with a "visible" input layer and a hidden layer, and connections between but not within layers. The training method for RBMs proposed by Hinton for use with training "Product of Expert" models is called contrastive divergence (CD). CD provides an approximation to the maximum likelihood method that would ideally be applied for learning the weights of the RBM. In training a single RBM, weight updates are performed with gradient ascent via the following equation: $\Delta w_{ij}(t+1)=w_{ij}(t)+\eta {\frac {\partial \log(p(v))}{\partial w_{ij}}}$

where, $p(v)$ is the probability of a visible vector, which is given by $p(v)={\frac {1}{Z}}\sum _{h}e^{-E(v,h)}$ . $Z$ is the partition function (used for normalizing) and $E(v,h)$ is the energy function assigned to the state of the network. A lower energy indicates the network is in a more "desirable" configuration. The gradient ${\frac {\partial \log(p(v))}{\partial w_{ij}}}$ has the simple form $\langle v_{i}h_{j}\rangle _{\text{data}}-\langle v_{i}h_{j}\rangle _{\text{model}}$ where $\langle \cdots \rangle _{p}$ represent averages with respect to distribution $p$ . The issue arises in sampling $\langle v_{i}h_{j}\rangle _{\text{model}}$ because this requires running alternating Gibbs sampling for a long time. CD replaces this step by running alternating Gibbs sampling for $n$ steps (values of $n=1$ have empirically been shown to perform well). After $n$ steps, the data are sampled and that sample is used in place of $\langle v_{i}h_{j}\rangle _{\text{model}}$ . The CD procedure works as follows:

Initialize the visible units to a training vector.
Update the hidden units in parallel given the visible units: $p(h_{j}=1\mid {\textbf {V}})=\sigma (b_{j}+\sum _{i}v_{i}w_{ij})$ . $\sigma$ is the sigmoid function and $b_{j}$ is the bias of $h_{j}$ .
Update the visible units in parallel given the hidden units: $p(v_{i}=1\mid {\textbf {H}})=\sigma (a_{i}+\sum _{j}h_{j}w_{ij})$ . $a_{i}$ is the bias of $v_{i}$ . This is called the "reconstruction" step.
Re-update the hidden units in parallel given the reconstructed visible units using the same equation as in step 2.
Perform the weight update: $\Delta w_{ij}\propto \langle v_{i}h_{j}\rangle _{\text{data}}-\langle v_{i}h_{j}\rangle _{\text{reconstruction}}$ .

Once an RBM is trained, another RBM is "stacked" atop it, taking its input from the final trained layer. The new visible layer is initialized to a training vector, and values for the units in the already-trained layers are assigned using the current weights and biases. The new RBM is then trained with the procedure above. This whole process is repeated until some desired stopping criterion is met.

Although the approximation of CD to maximum likelihood is crude (has been shown to not follow the gradient of any function), empirically it is effective in training deep architectures.

Convolutional deep belief networks

Convolutional deep belief networks (CDBN) have structure very similar to convolutional neural networks and are trained similarly to deep belief networks. Therefore, they exploit the 2D structure of images, like CNNs do, and make use of pre-training like deep belief networks. They provide a generic structure that can be used in many image and signal processing tasks. Benchmark results on standard image datasets like CIFAR have been obtained using CDBNs.

Large memory storage and retrieval neural networks

Large memory storage and retrieval neural networks (LAMSTAR) are fast deep learning neural networks of many layers that can use many filters simultaneously. These filters may be nonlinear, stochastic, logic, non-stationary, or even non-analytical. They are biologically motivated and learn continuously.

A LAMSTAR neural network may serve as a dynamic neural network in spatial or time domains or both. Its speed is provided by Hebbian link-weights that integrate the various and usually different filters (preprocessing functions) into its many layers and to dynamically rank the significance of the various layers and functions relative to a given learning task. This grossly imitates biological learning which integrates various preprocessors (cochlea, retina, etc.) and cortexes (auditory, visual, etc.) and their various regions. Its deep learning capability is further enhanced by using inhibition, correlation and its ability to cope with incomplete data, or "lost" neurons or layers even amidst a task. It is fully transparent due to its link weights. The link-weights allow dynamic determination of innovation and redundancy, and facilitate the ranking of layers, of filters or of individual neurons relative to a task.

LAMSTAR has been applied to many domains, including medical and financial predictions, adaptive filtering of noisy speech in unknown noise, still-image recognition, video image recognition, software security and adaptive control of non-linear systems. LAMSTAR had a much faster learning speed and somewhat lower error rate than a CNN based on ReLU-function filters and max pooling, in 20 comparative studies.

These applications demonstrate delving into aspects of the data that are hidden from shallow learning networks and the human senses, such as in the cases of predicting onset of sleep apnea events, of an electrocardiogram of a fetus as recorded from skin-surface electrodes placed on the mother's abdomen early in pregnancy, of financial prediction or in blind filtering of noisy speech.

LAMSTAR was proposed in 1996 (A U.S. Patent 5,920,852 A) and was further developed Graupe and Kordylewski from 1997-2002. A modified version, known as LAMSTAR 2, was developed by Schneider and Graupe in 2008.

Deep Boltzmann machines

A deep Boltzmann machine (DBM) is a type of binary pairwise Markov random field (undirected probabilistic graphical model) with multiple layers of hidden random variables. It is a network of symmetrically coupled stochastic binary units. It comprises a set of visible units ${\boldsymbol {\nu }}\in \{0,1\}^{D}$ , and a series of layers of hidden units ${\boldsymbol {h}}^{(1)}\in \{0,1\}^{F_{1}},{\boldsymbol {h}}^{(2)}\in \{0,1\}^{F_{2}},\ldots ,{\boldsymbol {h}}^{(L)}\in \{0,1\}^{F_{L}}$ . There is no connection between units of the same layer (like RBM). For the DBM, the probability assigned to vector ? is

where ${\boldsymbol {h}}=\{{\boldsymbol {h}}^{(1)},{\boldsymbol {h}}^{(2)},{\boldsymbol {h}}^{(3)}\}$ are the set of hidden units, and $\theta =\{{\boldsymbol {W}}^{(1)},{\boldsymbol {W}}^{(2)},{\boldsymbol {W}}^{(3)}\}$ are the model parameters, representing visible-hidden and hidden-hidden interactions. If ${\boldsymbol {W}}^{(2)}=0$ and ${\boldsymbol {W}}^{(3)}=0$ the network is the restricted Boltzmann machine. Interactions are symmetric because links are undirected. By contrast, in DBN only the top two layers form a restricted Boltzmann machine (which is an undirected graphical model), but lower layers form a directed generative model.

Like DBNs, DBMs can learn complex and abstract internal representations of the input in tasks such as object or speech recognition, using limited, labeled data to fine-tune the representations built using a large supply of unlabeled sensory input data. However, unlike DBNs and deep convolutional neural networks, they adopt the inference and training procedure in both directions, bottom-up and top-down pass, which allow the DBMs to better unveil the representations of the input structures.

However, the slow speed of DBMs limits their performance and functionality. Because exact maximum likelihood learning is intractable for DBMs, only approximate maximum likelihood learning is possible. Another option is to use mean-field inference to estimate data-dependent expectations, and approximate the expected sufficient statistics by using Markov chain Monte Carlo (MCMC). This approximate inference, which must be done for each test input, is about 25 to 50 times slower than a single bottom-up pass in DBMs. This makes joint optimization impractical for large data sets, and restricts the use of DBMs for tasks such as feature representation.

Stacked (de-noising) auto-encoders

The auto encoder idea is motivated by the concept of a good representation. For example, for a classifier, a good representation can be defined as one that yields a better-performing classifier.

An encoder is a deterministic mapping $f_{\theta }$ that transforms an input vector x into hidden representation y, where $\theta =\{{\boldsymbol {W}},b\}$ , ${\boldsymbol {W}}$ is the weight matrix and b is an offset vector (bias). A decoder maps back the hidden representation y to the reconstructed input z via $g_{\theta }$ . The whole process of auto encoding is to compare this reconstructed input to the original and try to minimize the error to make the reconstructed value as close as possible to the original.

In stacked denoising auto encoders, the partially corrupted output is cleaned (de-noised). This idea was introduced in 2010 by Vincent et al. with a specific approach to good representation, a good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input. Implicit in this definition are the following ideas:

The higher level representations are relatively stable and robust to input corruption;
It is necessary to extract features that are useful for representation of the input distribution.

The algorithm starts by a stochastic mapping of ${\boldsymbol {x}}$ to ${\tilde {\boldsymbol {x}}}$ through $q_{D}({\tilde {\boldsymbol {x}}}|{\boldsymbol {x}})$ , this is the corrupting step. Then the corrupted input ${\tilde {\boldsymbol {x}}}$ passes through a basic auto-encoder process and is mapped to a hidden representation ${\boldsymbol {y}}=f_{\theta }({\tilde {\boldsymbol {x}}})=s({\boldsymbol {W}}{\tilde {\boldsymbol {x}}}+b)$ . From this hidden representation, we can reconstruct ${\boldsymbol {z}}=g_{\theta }({\boldsymbol {y}})$ . In the last stage, a minimization algorithm runs in order to have z as close as possible to uncorrupted input ${\boldsymbol {x}}$ . The reconstruction error $L_{H}({\boldsymbol {x}},{\boldsymbol {z}})$ might be either the cross-entropy loss with an affine-sigmoid decoder, or the squared error loss with an affine decoder.

In order to make a deep architecture, auto encoders stack. Once the encoding function $f_{\theta }$ of the first denoising auto encoder is learned and used to uncorrupt the input (corrupted input), the second level can be trained.

Once the stacked auto encoder is trained, its output can be used as the input to a supervised learning algorithm such as support vector machine classifier or a multi-class logistic regression.

Deep stacking networks

A a deep stacking network (DSN) (deep convex network) is based on a hierarchy of blocks of simplified neural network modules. It was introduced in 2011 by Deng and Dong. It formulates the weights learning problem as a convex optimization problem with a closed-form solution. emphasizing the mechanism's similarity to stacked generalization. Each DSN block is a simple module that is easy to train by itself in a supervised fashion without backpropagation for the entire blocks.

Each block consists of a simplified multi-layer perceptron (MLP) with a single hidden layer. The hidden layer h has logistic sigmoidal units, and the output layer has linear units. Connections between these layers are represented by weight matrix U; input-to-hidden-layer connections have weight matrix W. Target vectors t form the columns of matrix T, and the input data vectors x form the columns of matrix X. The matrix of hidden units is ${\boldsymbol {H}}=\sigma ({\boldsymbol {W}}^{T}{\boldsymbol {X}})$ . Modules are trained in order, so lower-layer weights W are known at each stage. The function performs the element-wise logistic sigmoid operation. Each block estimates the same final label class y, and its estimate is concatenated with original input X to form the expanded input for the next block. Thus, the input to the first block contains the original data only, while downstream blocks' input adds the output of preceding blocks. Then learning the upper-layer weight matrix U given other weights in the network can be formulated as a convex optimization problem:

which has a closed-form solution.

Unlike other deep architectures, such as DBNs, the goal is not to discover the transformed feature representation. The structure of the hierarchy of this kind of architecture makes parallel learning straightforward, as a batch-mode optimization problem. In purely discriminative tasks, DSNs perform better than conventional DBNs.

Tensor deep stacking networks

This architecture is a DSN extension. It offers two important improvements: it uses higher-order information from covariance statistics, and it transforms the non-convex problem of a lower-layer to a convex sub-problem of an upper-layer. TDSNs use covariance statistics in a bilinear mapping from each of two distinct sets of hidden units in the same layer to predictions, via a third-order tensor.

While parallelization and scalability are not considered seriously in conventional DNNs, all learning for DSNs and TDSNs is done in batch mode, to allow parallelization. Parallelization allows scaling the design to larger (deeper) architectures and data sets.

The basic architecture is suitable for diverse tasks such as classification and regression.

Spike-and-slab RBMs

The need for deep learning with real-valued inputs, as in Gaussian restricted Boltzmann machines, led to the spike-and-slab RBM (ssRBM), which models continuous-valued inputs with strictly binary latent variables. Similar to basic RBMs and its variants, a spike-and-slab RBM is a bipartite graph, while like GRBMs, the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discrete probability mass at zero, while a slab is a density over continuous domain; their mixture forms a prior.

An extension of ssRBM called µ-ssRBM provides extra modeling capacity using additional terms in the energy function. One of these terms enables the model to form a conditional distribution of the spike variables by marginalizing out the slab variables given an observation.

Compound hierarchical-deep models

Compound hierarchical-deep models compose deep networks with non-parametric Bayesian models. Features can be learned using deep architectures such as DBNs, DBMs, deep auto encoders, convolutional variants, ssRBMs, deep coding networks, DBNs with sparse feature learning, RNNs, conditional DBNs, de-noising auto encoders. This provides a better representation, allowing faster learning and more accurate classification with high-dimensional data. However, these architectures are poor at learning novel classes with few examples, because all network units are involved in representing the input (a distributed representation) and must be adjusted together (high degree of freedom). Limiting the degree of freedom reduces the number of parameters to learn, facilitating learning of new classes from few examples. Hierarchical Bayesian (HB) models allow learning from few examples, for example for computer vision, statistics and cognitive science.

Compound HD architectures aim to integrate characteristics of both HB and deep networks. The compound HDP-DBM architecture is a hierarchical Dirichlet process (HDP) as a hierarchical model, incorporated with DBM architecture. It is a full generative model, generalized from abstract concepts flowing through the layers of the model, which is able to synthesize new examples in novel classes that look "reasonably" natural. All the levels are learned jointly by maximizing a joint log-probability score.

In a DBM with three hidden layers, the probability of a visible input ? is:

where ${\boldsymbol {h}}=\{{\boldsymbol {h}}^{(1)},{\boldsymbol {h}}^{(2)},{\boldsymbol {h}}^{(3)}\}$ is the set of hidden units, and $\psi =\{{\boldsymbol {W}}^{(1)},{\boldsymbol {W}}^{(2)},{\boldsymbol {W}}^{(3)}\}$ are the model parameters, representing visible-hidden and hidden-hidden symmetric interaction terms.

A learned DBM model is an undirected model that defines the joint distribution $P(\nu ,h^{1},h^{2},h^{3})$ . One way to express what has been learned is the conditional model $P(\nu ,h^{1},h^{2}|h^{3})$ and a prior term $P(h^{3})$ .

Here $P(\nu ,h^{1},h^{2}|h^{3})$ represents a conditional DBM model, which can be viewed as a two-layer DBM but with bias terms given by the states of $h^{3}$ :

Deep coding networks

A model that can actively update itself from the context in data has advantages. A deep predictive coding network (DPCN) is a predictive coding scheme where top-down information is used to empirically adjust the priors needed for a bottom-up inference procedure by means of a deep locally connected generative model. This works by extracting sparse features from time-varying observations using a linear dynamical model. Then, a pooling strategy is used to learn invariant feature representations. These units compose to form a deep architecture and are trained by greedy layer-wise unsupervised learning. The layers constitute a kind of Markov chain such that the states at any layer depend only on the preceding and succeeding layers.

DPCNs predict the representation of the layer, by using a top-down approach using the information in upper layer and temporal dependencies from previous states.

DPCNs can be extended to form a convolutional network.

Deep Q-networks

A deep Q-network (DQN) is a type of deep learning model that combines a deep CNN with Q-learning, a form of reinforcement learning. Unlike earlier reinforcement learning agents, DQNs can learn directly from high-dimensional sensory inputs.

Preliminary results were presented in 2014, with an accompanying paper in February 2015. The research described an application to Atari 2600 gaming. Other deep reinforcement learning models preceded it.

Networks with separate memory structures

Integrating external memory with ANNs dates to early research in distributed representations and Kohonen's self-organizing maps. For example, in sparse distributed memory or hierarchical temporal memory, the patterns encoded by neural networks are used as addresses for content-addressable memory, with "neurons" essentially serving as address encoders and decoders. However, the early controllers of such memories were not differentiable.

LSTM-related differentiable memory structures

Apart from long short-term memory (LSTM), other approaches also added differentiable memory to recurrent functions. For example:

Differentiable push and pop actions for alternative memory networks called neural stack machines
Memory networks where the control network's external differentiable storage is in the fast weights of another network
LSTM forget gates
Self-referential RNNs with special output units for addressing and rapidly manipulating the RNN's own weights in differentiable fashion (internal storage)
Learning to transduce with unbounded memory

Neural Turing machines

Neural Turing machines couple LSTM networks to external memory resources, with which they can interact by attentional processes. The combined system is analogous to a Turing machine but is differentiable end-to-end, allowing it to be efficiently trained by gradient descent. Preliminary results demonstrate that neural Turing machines can infer simple algorithms such as copying, sorting and associative recall from input and output examples.

Semantic hashing

Approaches that represent previous experiences directly and use a similar experience to form a local model are often called nearest neighbour or k-nearest neighbors methods. Deep learning is useful in semantic hashing where a deep graphical model the word-count vectors obtained from a large set of documents. Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by accessing all the addresses that differ by only a few bits from the address of the query document. Unlike sparse distributed memory that operates on 1000-bit addresses, semantic hashing works on 32 or 64-bit addresses found in a conventional computer architecture.

Memory networks

Memory networks are another extension to neural networks incorporating long-term memory. The long-term memory can be read and written to, with the goal of using it for prediction. These models have been applied in the context of question answering (QA) where the long-term memory effectively acts as a (dynamic) knowledge base and the output is a textual response.

Pointer networks

Deep neural networks can be potentially improved by deepening and parameter reduction, while maintaining trainability. While training extremely deep (e.g., 1 million layers) neural networks might not be practical, CPU-like architectures such as pointer networks and neural random-access machines overcome this limitation by using external random-access memory and other components that typically belong to a computer architecture such as registers, ALU and pointers. Such systems operate on probability distribution vectors stored in memory cells and registers. Thus, the model is fully differentiable and trains end-to-end. The key characteristic of these models is that their depth, the size of their short-term memory, and the number of parameters can be altered independently -- unlike models like LSTM, whose number of parameters grows quadratically with memory size.

Encoder-decoder networks

Encoder-decoder frameworks are based on neural networks that map highly structured input to highly structured output. The approach arose in the context of machine translation, where the input and output are written sentences in two natural languages. In that work, an LSTM RNN or CNN was used as an encoder to summarize a source sentence, and the summary was decoded using a conditional RNN language model to produce the translation. These systems share building blocks: gated RNNs and CNNs and trained attention mechanisms.

TG16: Neural Networks For Dummies - YouTube

photo src: www.youtube.com

Multilayer kernel machine

Multilayer kernel machines (MKM) are a way of learning highly nonlinear functions by iterative application of weakly nonlinear kernels. They use the kernel principal component analysis (KPCA), as a method for the unsupervised greedy layer-wise pre-training step of the deep learning architecture.

Layer $l+1$ learns the representation of the previous layer $l$ , extracting the $n_{l}$ principal component (PC) of the projection layer $l$ output in the feature domain induced by the kernel. For the sake of dimensionality reduction of the updated representation in each layer, a supervised strategy is proposed to select the best informative features among features extracted by KPCA. The process is:

rank the $n_{l}$ features according to their mutual information with the class labels;
for different values of K and $m_{l}\in \{1,\ldots ,n_{l}\}$ , compute the classification error rate of a K-nearest neighbor (K-NN) classifier using only the $m_{l}$ most informative features on a validation set;
the value of $m_{l}$ with which the classifier has reached the lowest error rate determines the number of features to retain.

Some drawbacks accompany the KPCA method as the building cells of an MKM.

A more straightforward way to use kernel machines for deep learning was developedfor spoken language understanding. The main idea is to use a kernel machine to approximate a shallow neural net with an infinite number of hidden units, then use stacking to splice the output of the kernel machine and the raw input in building the next, higher level of the kernel machine. The number of levels in the deep convex network is a hyper-parameter of the overall system, to be determined by cross validation.

photo src: takeagile.com

Applications

Automatic speech recognition

Speech recognition was revolutionised by deep learning, especially by long short-term memory RNNs. LSTM RNNs circumvent the vanishing gradient problem and can learn "Very Deep Learning" tasks that involve multi-second intervals containing speech events separated by thousands of discrete time steps, where one time step corresponds to about 10 ms. In 2003, LSTM with forget gates became competitive with traditional speech recognizers on certain tasks. In 2007, LSTM trained by CTC achieved excellent results on tasks such as discriminative keyword spotting. In 2015, Google's speech recognition almost doubled its performance through CTC-trained LSTM.

The initial success in speech recognition, was based on small-scale recognition tasks based on the popular TIMIT data set (a common data set used for evaluations). The set contains 630 speakers from eight major dialects of American English, where each speaker reads 10 sentences. Its small size allows many configurations to be tried. More importantly, the TIMIT task concerns phone-sequence recognition, which, unlike word-sequence recognition, allows very weak "language models". This allows the weaknesses in acoustic modeling aspects of speech recognition to be more easily analyzed. Analysis on TIMIT by Li and collaborators around 2009-2010, contrasting the GMM (and other generative speech models) vs. DNN models, stimulated early industrial investment in deep learning for speech recognition, eventually leading to pervasive and dominant use in that industry. That analysis was done with comparable performance (less than 1.5% in error rate) between discriminative DNNs and generative models. The error rates listed below, including these early results and measured as percent phone error rates (PER), have been summarized over the past 20 years:

In 2010, researchers extended deep learning from TIMIT to large vocabulary speech recognition, by adopting large output layers of the DNN based on context-dependent HMM states constructed by decision trees.

The principle of elevating "raw" features over hand-crafted optimization was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features in the late 1990s, showing its superiority over the Mel-Cepstral features that contain stages of fixed transformation from spectrograms. The raw features of speech, waveforms, later produced excellent larger-scale results.

The debut of DNNs for speaker recognition in the late 1990s and speech recognition around 2009-2011 and of LSTM around 2003-2007, accelerated progress in eight major areas:

Scale-up/out and acclerated DNN training and decoding
Sequence discriminative training
Feature processing by deep models with solid understanding of the underlying mechanisms
Adaptation of DNNs and related deep models
Multi-task and transfer learning by DNNs and related deep models
CNNs and how to design them to best exploit domain knowledge of speech
RNN and its rich LSTM variants
Other types of deep models including tensor-based models and integrated deep generative/discriminative models.

Large-scale automatic speech recognition is the first and most convincing successful case of deep learning. Between 2010 and 2014, the two major conferences on signal processing and speech recognition, IEEE-ICASSP and Interspeech both saw a large increase in the numbers of accepted papers in their respective annual conference papers on the topic of deep learning for speech recognition. All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.

Image recognition

A common evaluation set for image classification is the MNIST database data set. MNIST is composed of handwritten digits and includes 60,000 training examples and 10,000 test examples. As with TIMIT, its small size allows multiple configurations to be tested. A comprehensive list of results on this set is available. The current best result on MNIST is an error rate of 0.23%, achieved by Ciresan et al. in 2012.

According to LeCun, in the early 2000s, CNNs processed an estimated 10% to 20% of all the checks written in the US.

Significant additional impacts in image or object recognition were felt from 2011-2012. Although CNNs trained by backpropagation had been around for decades, and GPU implementations of NNs for years, including CNNs, fast implementations of CNNs with max-pooling on GPUs in the style of Ciresan and colleaguesg were needed to progress on computer vision. In 2011, this approach achieved for the first time superhuman performance in a visual pattern recognition contest. Also in 2011, it won the ICDAR Chinese handwriting contest, and in May 2012, it won the ISBI image segmentation contest. Until 2011, CNNs did not play a major role at computer vision conferences, but in June 2012, a paper by Ciresan et al. at the leading conference CVPR showed how max-pooling CNNs on GPU can dramatically improve many vision benchmark records. In October 2012, a similar system by Krizhevsky and Hinton won the large-scale ImageNet competition by a significant margin over shallow machine learning methods. In November 2012, Ciresan et al.'s system also won the ICPR contest on analysis of large medical images for cancer detection, and in the following year also the MICCAI Grand Challenge on the same topic. In 2013 and 2014, the error rate on the ImageNet task using deep learning was further reduced, following a similar trend in large-scale speech recognition. The Wolfram Image Identification project publicized these improvements.

Image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs. Deep learning-trained vehicles now interpret 360° camera views. Another example is Facial Dysmorphology Novel Analysis (FDNA) used to analyze cases of human malformation connected to a large database of genetic syndromes.

Natural language processing

Neural networks have been used for implementing language models since the early 2000s. Recurrent neural networks, especially LSTM, are most appropriate for sequential data such as language. LSTM helped to improve machine translation and language modeling.

Other key techniques in this field are negative sampling and word embedding. Word embedding, such as word2vec, can be thought of as a representational layer in a deep learning architecture that transforms an atomic word into a positional representation of the word relative to other words in the dataset; the position is represented as a point in a vector space. Using word embedding as an RNN input layer allows the network to parse sentences and phrases using an effective compositional vector grammar. A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN. Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing. Deep neural architectures have achieved state-of-the-art results in natural language processing tasks such as constituency parsing, sentiment analysis, information retrieval, spoken language understanding, machine translation, contextual entity linking, writing style recognition and others.

Drug discovery and toxicology

A large percentage of candidate drugs fail to win regulatory approval. These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects. In 2012, a team led by Dahl won the "Merck Molecular Activity Challenge" using multi-task deep neural networks to predict the biomolecular target of one drug. In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the "Tox21 Data Challenge" of NIH, FDA and NCATS. Deep learning may outdo other virtual screening methods. Researchers enhanced deep learning for drug discovery by combining data from a variety of sources. In 2015, Atomwise introduced AtomNet, the first deep learning neural networks for structure-based rational drug design. Subsequently, AtomNet was used to predict novel candidate biomolecules for several disease targets, most notably treatments for the Ebola virus and multiple sclerosis.

Customer relationship management

Deep reinforcement learning demonstrated a use in direct marketing settings, illustrating suitability for CRM automation. A neural network was used to approximate the value of possible direct marketing actions over the customer state space, defined in terms of RFM variables. The estimated value function was shown to have a natural interpretation as customer lifetime value.

Recommendation systems

Recommendation systems have used deep learning to extract meaningful features for a latent factor model for content-based music recommendations. Multiview deep learning has been applied for learning user preferences from multiple domains. The model uses a hybrid collaborative and content-based approach and enhances recommendations in multiple tasks.

Bioinformatics

An autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships.

In medical informatics, deep learning was used to predict sleep quality based on data from wearables and predictions of health complications from Electronic health record data.

photo src: takeagile.com

Relation to human development

Deep learning is closely related to a class of theories of brain development (specifically, neocortical development) proposed by cognitive neuroscientists in the early 1990s. An approachable summary of this work is Elman, et al.'s 1996 Rethinking Innateness (see also: Shrager and Johnson; Quartz and Sejnowski). These developmental theories were instantiated in computational models, making them predecessors of purely computation-derived deep learning models. These developmental models share the property that various proposed learning dynamics in the brain (e.g., a wave of nerve growth factor) support the self-organization similar to the inter-related neural networks utilized in deep learning models. Such computational neural networks seem analogous to a view of the neocortex as a hierarchy of filters in which each layer captures some of the information in the operating environment, and then passes the remainder, as well as modified base signal, to other layers further up the hierarchy. This process yields a self-organizing stack of transducers, well-tuned to their operating environment. A 1995 description stated, "...the infant's brain seems to organize itself under the influence of waves of so-called trophic-factors ... different regions of the brain become connected sequentially, with one layer of tissue maturing before another and so on until the whole brain is mature."

photo src: takeagile.com

Commercial activity

Many organizations have become interested in deep learning for particular applications. In 2013, Facebook hired Yann LeCun to head its new artificial intelligence (AI) lab. The AI lab will help perform tasks such as automatically tagging uploaded pictures with the names of the people in them. In 2014, Facebook hired Vladimir Vapnik, a main developer of the Vapnik-Chervonenkis theory of statistical learning, and co-inventor of the support vector machine method.

In 2014, Google bought DeepMind Technologies, a British start-up that developed a system capable of learning how to play Atari video games using only pixels as data input. In 2015 they demonstrated their AlphaGo system which achieved one of the long-standing "grand challenges" of AI by learning the game of Go well enough to beat a professional Go player.

In 2015, Blippar demonstrated a mobile augmented reality application that uses deep learning to recognize objects in real time.

photo src: takeagile.com

Criticism and comment

Deep learning has attracted both criticism and comment, in some cases from outside the field of computer science.

A main criticism concerns the lack of theory surrounding the methods. Learning in the most common deep architectures is implemented using well-understood gradient descent. However, the theory surrounding other algorithms, such as contrastive divergence is less clear. (e.g., Does it converge? If so, how fast? What is it approximating?) Deep learning methods are often looked at as a black box, with most confirmations done empirically, rather than theoretically.

Others point out that deep learning should be looked at as a step towards realizing strong AI, not as an all-encompassing solution. Despite the power of deep learning methods, they still lack much of the functionality needed for realizing this goal entirely. Research psychologist Gary Marcus noted:

"Realistically, deep learning is only part of the larger challenge of building intelligent machines. Such techniques lack ways of representing causal relationships (...) have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge, such as information about what objects are, what they are for, and how they are typically used. The most powerful A.I. systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning."

Alternatively, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between "old master" and amateur figure drawings; while another hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy. Another author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity.

In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was for a time the most frequently accessed article on The Guardian's web site.

Some deep learning architectures display problematic behaviors, such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images and misclassifying minuscule perturbations of correctly classified images. Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component AGI architectures. These issues may possibly be addresed by deep learning architectures that internally form states homologous to image-grammar decompositions of observed entities and events. Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition and AI.

Machine Learning for Dummies | Kartik Hosanagar | Pulse | LinkedIn

photo src: www.linkedin.com

Software libraries

Deeplearning4j -- An open-source deep-learning library written for Java/C++ with LSTMs and convolutional networks. It provides parallelization with Spark on CPUs and GPUs.
Gensim -- A toolkit for natural language processing implemented in the Python programming language.
Keras -- An open-source deep learning framework for the Python programming language.
Microsoft CNTK (Computational Network Toolkit) -- Microsoft's open-source deep-learning toolkit for Windows and Linux. It provides parallelization with CPUs and GPUs across multiple servers.
MXNet -- An open source deep learning framework that allows you to define, train, and deploy deep neural networks.
OpenNN -- An open source C++ library which implements deep neural networks and provides parallelization with CPUs.
PaddlePaddle -- An open source C++ /CUDA library with Python API for scalable deep learning platform with CPUs and GPUs, originally developed by Baidu.
TensorFlow -- Google's open source machine learning library in C++ and Python with APIs for both. It provides parallelization with CPUs and GPUs.
Theano -- An open source machine learning library for Python supported by the University of Montreal and Yoshua Bengio's team.
Torch -- An open source software library for machine learning based on the Lua programming language and used by Facebook.
Caffe - Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
DIANNE - A modular open-source deep learning framework in Java / OSGi developed at Ghent University, Belgium. It provides parallelization with CPUs and GPUs across multiple servers.

Source of the article : Wikipedia