I'm at my first annual NIPS conference this year in Montreal, the annual pow-wow for machine learning and deep learning in particular.
Monday, the first day, had several multi-hour in-depth tutorials from literally the folks that wrote the textbooks in these areas. I attended sessions on scaling deep learning via TensorFlow, presented by Google folks like Jeff Dean; a deep dive into probabilistic programming (being able to describe a statistical system and allow an inference engine to do the hard work of building a model from it); and an introduction to reinforcement learning (using a scalar reward signal to automatically discover the optimal policy for a behavior).
These are my notes from the tutorials; I've used my stretchtext library to make them easier to work with, simply click one to open it up and see my in-depth notes.
Started in perceptual domain in 2011 and moving towards more tasks, in particular language translation
Taking research systems into production systems
Can be used for more than just neural nets
Systems that are easy to express but also easy to scale
Universal machine learning: same basic learning algorithms work across many domains
Distbelief good for production use but not for research use - not flexible for things like reinforcement learning or some funky LSTM architecture, for example
Motivation: train large models with lots of data fast
General trend is towards more complex models: embedding of various kinds, attention, memory, generative models
Lots of frames from YouTube with autoencoder to recreate frames - 1000 machines on 10 million images, system automatically learned very high level features
Speech: feed forward acoustic models, taking acoustic data and matching it to speech
Speech models are getting more complicated, using deep LSTM layers and convolutional layers
Trend: LSTMs end to end!
Makes system simpler to express and perform better as there aren't arbitrary layers, end to end model can discover correct layering
Inception-v3 from this month - new drop in error rate
Inception had many fewer parameters because it removed the fully connected layer
Trend: in distributed learning you are shipping parameters around, so you want to have less model parameters
Batch normalization gave a nice drop in error rates
Models with small number of parameters fit easily in a mobile app (8-bit fixed point)
Just released new state of the art model for use by others
Runs on mobile
What do you want in a research system? Ease of expression, scalability, portability, reproducibility, production readiness
Updates for distributed tensorflow implementation coming soon
Tensorflow architecture: core on c++
Graph execution engine, supports wide variety of devices and that core is extensible
Runs on Linux and android, internal version running on iOS not yet released
Different front ends for specifying graphs and computations
Python and c++ today
Going to ship distributed system that used kubernetes and grpc for distributed bits
Whenever an edge crosses a device boundary, graph gets rewritten to have send and receive nodes to receive values when ready
Send and receive node abstractions work under different scenarios: multiple gpus, cross machine rpc, gpus on different machines, etc.
All complexity of transferring data captured by send and receive nodes
Core system has standard operations (add, jpg decoding, etc) and kernels (device specific implementations of operations for speed ups)
Easy to define new operations and kernels
Session interface: client describes graph using series of extend calls that add nodes to graph, then run to execute sub graph, optionally feeding in tensor inputs and getting tensor outputs, useful for debugging
Typically, setup a graph with one or a few extend calls and then run it thousands or millions of times. Generally system can work under the covers at extend time to be more efficient at run time.
Single process configuration
Being able to execute arbitrary input and output tensors to run makes debugging much easier
Can be used to have multiple graphs together for different uses
Example: power method for eigenvectors
Tensor board - tool for showing graphs
Example: symbolic differentiation
SummaryWriter - used to write out graph for debugging
Placeholder - a variable that will change on each input of the graph (i.e. It's the data you want to input)
Initial results on single device not great
New code they are working on:
Future step is update to latest CuDNN libraries, won't make it into 0.2 release
Single device important but most focused on distributed performance
Distributed bits give a greater speed up than single device
If you can turn around your research in minutes or hours than very powerful
How do you do this at scale?
Best way to decrease training time: decrease the step time
Many models of inherent parallelism, problem is distributing work so communication doesn't kill you - local connectivity (cnns), towers with little or no connectivity, specialized parts of model active only for some examples
Fully connected layers very expensive in terms of this
On a single core: SIMD to get instruction parallelism
Across cores: thread parallelism
Across devices: for gpus, often limited by PCIe bandwidth
Across machines: limited by network bandwidth and latency
Model parallelism can reduce step time, most effective way to speed up training time
Data parallelism - can work in conjunction
Use multiple model replicas to process different examples at the same time, 10 through 40x speed up
Ten to a thousand replicas work fine in asynchronous regime
Want models to not have too many parameters because of network time
Mini batches of size B reuse parameters B times
Convolutional models tend to reuse hundreds or thousands of times per example
Recurrent models tend to reuse tens to hundreds of times T time steps due to unrolling during training
Want model computation time to be large relative to time to send and receive parameters over network
Data parallelism is really imports for many of googled problems (very large datasets, large models): rank brain uses 500 replicas, Imagenet inception training uses 50 gpus, 40x speed up
Smart reply uses 16 replicas, each with multiple gpus
Trend: lots of LSTMs and creativity in neural network architectural approaches
Keeping things in memory for a long time was crucial
Would be very hard to deal with these without being able to translate equations into graph and symbolic differentiation
Trend: lots of people doing sequence to sequence paradigm problems
Deep LSTMs are very important
To go distributed:
Tensorflow queues: put data in a queue, when it's full launch forward backward for however depth the queue has
To make it asynchronous data parallelism:
Network optimizations: neural nets are very tolerant of reduced precision, drop precision to 16 bits across network, send and receive nodes can do this transparently, Google simply chops off 16 bits of mantissa
Quantization for inference: need even less precision for inference, 8 bit fixed point works well, figure out range of weights then scale to 0 to 255, with quantization can run at 6 fps on smart phone
Multi GPU in single machine in open source release
Distributed implementation coming soon issue 23 on github repo
What is probabilistic programming?
What parameters could have given rise to the observation, using inference
The model is some programming language construct
Probabilistic program: usual functional or imperative programs with two added constructs: the ability to draw values at random from distributions and the ability to condition values of variables in a program via observations
Goals of the field: increase programmer productivity - decrease effort to make new models
Commodify inference - build models and simulators above inference engines
New kinds of models
what we want from a language - unrestricted, unfettered access to existing libraries, easily extensible, open universe, mixed variable types. Comes at a cost: inference is harder, more ways to shoot yourself in foot
Inverse graphics - given some image, make a scene description
Once you have source code and a simulation environment you can take it in new directions - add new scene lighting or poses to an inferred model of a face, for example
Given some behavior, what cognitive process gave rise to it?
Given some program output, what source code would give rise to it?
Given some constraint, what simulation would give rise to it?
Universal probabilistic programming modeling language - Anglican, church, venture, webppl
Will use Anglican in examples
Cool demo showing programming sample to find "bumpers" that can bounce balls into a bin
Talk moved on to building the inference engine - left earlier as just interested in programming languages themselves
Agent oriented learning
Learn by interacting with environment
Learn by trial and error, only weak delayed evaluative reward
Environment: Unknown, nonlinear, stochastic, and complex
Challenges of RL: reward, delayed consequences, need for trial and error - explore and exploit, non-stationary, fleeting nature of time and online data
Example: TD-Gammon from Tesauro, 1992-1995
Tension between exploring and exploiting
Best way to make a learning algorithm is to pretend you are in the chair and see what you would do
Want to get a lot of reward in the long run
Finite markoff decision process - mdp
Keep things finite for now to prevent math from getting too complex with infinite versions for now
The number of deterministic policies is exponential in the number of states - very challenging to search
The trick - action-value functions
Arg max - the identity of the function that maximizes something, rather then it's arguments (max height of people in room vs person with maximum height)
Exact solution methods
Array Q will end up having solution, start with random values
Off policy method - learning the right thing to do but not using it during training time, target policy is optimal policy, when have two policies called off policy learning
One policy is better than another if for all state action pairs it is better
The dance is very robust - to initial conditions, asynchronous updates, incomplete evaluation, randomization and noise
In particular, works if only a single state is updated at a time by a random amount
The explore exploit dilemma - to find best action need to explore them all, but if you do you will never exploit policy! Need to find right balance, still a bit of an open problem in the field
How did q learning escape the dilemma? Choose actions in any way you want, such that all actions are taken in all states
Q learning is a temporal difference learning algorithm
Q learning is off policy learning
Explores from data while behaving in a more exploratory manner
Target policy - policy being learned
Behavior policy - policy to learn things
Approximate solution methods
Real world too complex and large for tables, too many state action pairs
Will RL work with approximations and function approximations?
Bug in slide - should be q hat on left
Does q learning work with function approximation? Yes it often works well, but there are counter examples, simple examples where the params diverge to infinity
Something is not right, there is something missing from our current understanding
Near greedy - epsilon greedy
Why called sarsa? State action reward next state and action
Epsilon greedy policy - stochastic policy that is usually greedy, but with small probability epsilon selects an action at random
Hacky way to get exploration
On policy methods have much better convergence properties
On policy methods perform better than off policy methods but find poorer policies
Demo with soccer ball of three players trying to keep it from 2 players
Problem of instability - nothing to do with learning or sampling, nothing to do with exploitation, greedification, or control, nothing to do with complex nonlinear approximators
Deadly triad - risk of divergence increases if we combine three things: function approximation, bootstrapping, off policy learning
Nice thing about off policy learning can learn about many things at once in parallel
Can we do without bootstrapping? Critical to computational and data efficiency, introduces bias though
Can add in term that controls degree of bootstrapping
We want to keep bootstrapping however
Other ways to survive deadly triad - use high lambda and good features, features that are more markov, recent results suggest that using replay and more stable targets may help but too soon to know (double q learning by van hasselt 2010), use least squares methods like off policy LSTD lambda, expensive in computation though
Do real gradient descent (gradient-td and proximal-gradient-td) or emphatic-td methods
The second day of NIPS didn't have as many sessions I was interested in, as it was mostly focused on optimization theory. The only stand out was Song Han's pair of papers on compressing neural networks, showing impressive results:
Subscribe to my RSS feed and follow me on Twitter to stay up to date on new posts.
Please note that this is my personal blog — the views expressed on these pages are mine alone and not those of my employer.