Coding in Paradise

About Brad Neuberg

NIPS Day 1

Tutorials on Scaling Deep Learning, Probabilistic Programming, and Reinforcement Learning

I'm at my first annual NIPS conference this year in Montreal, the annual pow-wow for machine learning and deep learning in particular.

Monday, the first day, had several multi-hour in-depth tutorials from literally the folks that wrote the textbooks in these areas. I attended sessions on scaling deep learning via TensorFlow, presented by Google folks like Jeff Dean; a deep dive into probabilistic programming (being able to describe a statistical system and allow an inference engine to do the hard work of building a model from it); and an introduction to reinforcement learning (using a scalar reward signal to automatically discover the optimal policy for a behavior).

These are my notes from the tutorials; I've used my stretchtext library to make them easier to work with, simply click one to open it up and see my in-depth notes.

Large Scale Distributed Systems for Training Neural Networks Tutorial

Jeff Dean and Oriol Vinyals

Started in perceptual domain in 2011 and moving towards more tasks, in particular language translation

Taking research systems into production systems

Can be used for more than just neural nets

Systems that are easy to express but also easy to scale

Universal machine learning: same basic learning algorithms work across many domains

Distbelief good for production use but not for research use - not flexible for things like reinforcement learning or some funky LSTM architecture, for example

Motivation: train large models with lots of data fast

General trend is towards more complex models: embedding of various kinds, attention, memory, generative models

Lots of frames from YouTube with autoencoder to recreate frames - 1000 machines on 10 million images, system automatically learned very high level features

Speech: feed forward acoustic models, taking acoustic data and matching it to speech

Speech models are getting more complicated, using deep LSTM layers and convolutional layers

Trend: LSTMs end to end!

Makes system simpler to express and perform better as there aren't arbitrary layers, end to end model can discover correct layering

Inception-v3 from this month - new drop in error rate

Inception had many fewer parameters because it removed the fully connected layer

Trend: in distributed learning you are shipping parameters around, so you want to have less model parameters

Batch normalization gave a nice drop in error rates

Models with small number of parameters fit easily in a mobile app (8-bit fixed point)

Just released new state of the art model for use by others

Runs on mobile

What do you want in a research system? Ease of expression, scalability, portability, reproducibility, production readiness

Updates for distributed tensorflow implementation coming soon

Tensorflow architecture: core on c++

Graph execution engine, supports wide variety of devices and that core is extensible

Runs on Linux and android, internal version running on iOS not yet released

Different front ends for specifying graphs and computations

Python and c++ today

Going to ship distributed system that used kubernetes and grpc for distributed bits

Whenever an edge crosses a device boundary, graph gets rewritten to have send and receive nodes to receive values when ready

Send and receive node abstractions work under different scenarios: multiple gpus, cross machine rpc, gpus on different machines, etc.

All complexity of transferring data captured by send and receive nodes

Core system has standard operations (add, jpg decoding, etc) and kernels (device specific implementations of operations for speed ups)

Easy to define new operations and kernels

Session interface: client describes graph using series of extend calls that add nodes to graph, then run to execute sub graph, optionally feeding in tensor inputs and getting tensor outputs, useful for debugging

Typically, setup a graph with one or a few extend calls and then run it thousands or millions of times. Generally system can work under the covers at extend time to be more efficient at run time.

Single process configuration

Being able to execute arbitrary input and output tensors to run makes debugging much easier

Can be used to have multiple graphs together for different uses

Example: power method for eigenvectors

Tensor board - tool for showing graphs

Example: symbolic differentiation

SummaryWriter - used to write out graph for debugging

Placeholder - a variable that will change on each input of the graph (i.e. It's the data you want to input)

Initial results on single device not great

New code they are working on:

Future step is update to latest CuDNN libraries, won't make it into 0.2 release

Single device important but most focused on distributed performance

Distributed bits give a greater speed up than single device

If you can turn around your research in minutes or hours than very powerful

How do you do this at scale?

Best way to decrease training time: decrease the step time

Many models of inherent parallelism, problem is distributing work so communication doesn't kill you - local connectivity (cnns), towers with little or no connectivity, specialized parts of model active only for some examples

Fully connected layers very expensive in terms of this

On a single core: SIMD to get instruction parallelism

Across cores: thread parallelism

Across devices: for gpus, often limited by PCIe bandwidth

Across machines: limited by network bandwidth and latency

Model parallelism can reduce step time, most effective way to speed up training time

Data parallelism - can work in conjunction

Use multiple model replicas to process different examples at the same time, 10 through 40x speed up

Ten to a thousand replicas work fine in asynchronous regime

Want models to not have too many parameters because of network time

Mini batches of size B reuse parameters B times

Convolutional models tend to reuse hundreds or thousands of times per example

Recurrent models tend to reuse tens to hundreds of times T time steps due to unrolling during training

Want model computation time to be large relative to time to send and receive parameters over network

Data parallelism is really imports for many of googled problems (very large datasets, large models): rank brain uses 500 replicas, Imagenet inception training uses 50 gpus, 40x speed up

Smart reply uses 16 replicas, each with multiple gpus

Trend: lots of LSTMs and creativity in neural network architectural approaches

Keeping things in memory for a long time was crucial

Would be very hard to deal with these without being able to translate equations into graph and symbolic differentiation

Trend: lots of people doing sequence to sequence paradigm problems

Deep LSTMs are very important

To go distributed:

Tensorflow queues: put data in a queue, when it's full launch forward backward for however depth the queue has

To make it asynchronous data parallelism:

Network optimizations: neural nets are very tolerant of reduced precision, drop precision to 16 bits across network, send and receive nodes can do this transparently, Google simply chops off 16 bits of mantissa

Quantization for inference: need even less precision for inference, 8 bit fixed point works well, figure out range of weights then scale to 0 to 255, with quantization can run at 6 fps on smart phone

Multi GPU in single machine in open source release

Distributed implementation coming soon issue 23 on github repo

Probabilistic Programming Tutorial

Frank Wood

What is probabilistic programming?

What parameters could have given rise to the observation, using inference

The model is some programming language construct

Probabilistic program: usual functional or imperative programs with two added constructs: the ability to draw values at random from distributions and the ability to condition values of variables in a program via observations

Goals of the field: increase programmer productivity - decrease effort to make new models

Commodify inference - build models and simulators above inference engines

New kinds of models

what we want from a language - unrestricted, unfettered access to existing libraries, easily extensible, open universe, mixed variable types. Comes at a cost: inference is harder, more ways to shoot yourself in foot

Inverse graphics - given some image, make a scene description

Once you have source code and a simulation environment you can take it in new directions - add new scene lighting or poses to an inferred model of a face, for example

Given some behavior, what cognitive process gave rise to it?

Given some program output, what source code would give rise to it?

Given some constraint, what simulation would give rise to it?

Universal probabilistic programming modeling language - Anglican, church, venture, webppl

Will use Anglican in examples

Cool demo showing programming sample to find "bumpers" that can bounce balls into a bin

Talk moved on to building the inference engine - left earlier as just interested in programming languages themselves

Introduction to Reinforcement Learning with Function Approximation Tutorial

Rich Sutton

Agent oriented learning

Learn by interacting with environment

Learn by trial and error, only weak delayed evaluative reward

Environment: Unknown, nonlinear, stochastic, and complex

Challenges of RL: reward, delayed consequences, need for trial and error - explore and exploit, non-stationary, fleeting nature of time and online data

Example: TD-Gammon from Tesauro, 1992-1995

Tension between exploring and exploiting

Best way to make a learning algorithm is to pretend you are in the chair and see what you would do

Want to get a lot of reward in the long run

Finite markoff decision process - mdp

Keep things finite for now to prevent math from getting too complex with infinite versions for now

The number of deterministic policies is exponential in the number of states - very challenging to search

The trick - action-value functions

Arg max - the identity of the function that maximizes something, rather then it's arguments (max height of people in room vs person with maximum height)

Exact solution methods

Array Q will end up having solution, start with random values

Off policy method - learning the right thing to do but not using it during training time, target policy is optimal policy, when have two policies called off policy learning

One policy is better than another if for all state action pairs it is better

The dance is very robust - to initial conditions, asynchronous updates, incomplete evaluation, randomization and noise

In particular, works if only a single state is updated at a time by a random amount

The explore exploit dilemma - to find best action need to explore them all, but if you do you will never exploit policy! Need to find right balance, still a bit of an open problem in the field

How did q learning escape the dilemma? Choose actions in any way you want, such that all actions are taken in all states

Q learning is a temporal difference learning algorithm

Q learning is off policy learning

Explores from data while behaving in a more exploratory manner

Target policy - policy being learned

Behavior policy - policy to learn things

Approximate solution methods

Real world too complex and large for tables, too many state action pairs

Will RL work with approximations and function approximations?

Bug in slide - should be q hat on left

Does q learning work with function approximation? Yes it often works well, but there are counter examples, simple examples where the params diverge to infinity

Something is not right, there is something missing from our current understanding

Near greedy - epsilon greedy

Why called sarsa? State action reward next state and action

Epsilon greedy policy - stochastic policy that is usually greedy, but with small probability epsilon selects an action at random

Hacky way to get exploration

On policy methods have much better convergence properties

On policy methods perform better than off policy methods but find poorer policies

Demo with soccer ball of three players trying to keep it from 2 players

Problem of instability - nothing to do with learning or sampling, nothing to do with exploitation, greedification, or control, nothing to do with complex nonlinear approximators

Deadly triad - risk of divergence increases if we combine three things: function approximation, bootstrapping, off policy learning

Nice thing about off policy learning can learn about many things at once in parallel

Can we do without bootstrapping? Critical to computational and data efficiency, introduces bias though

Can add in term that controls degree of bootstrapping

We want to keep bootstrapping however

Other ways to survive deadly triad - use high lambda and good features, features that are more markov, recent results suggest that using replay and more stable targets may help but too soon to know (double q learning by van hasselt 2010), use least squares methods like off policy LSTD lambda, expensive in computation though

Do real gradient descent (gradient-td and proximal-gradient-td) or emphatic-td methods

The second day of NIPS didn't have as many sessions I was interested in, as it was mostly focused on optimization theory. The only stand out was Song Han's pair of papers on compressing neural networks, showing impressive results: 

Subscribe to my RSS feed and follow me on Twitter to stay up to date on new posts.

Please note that this is my personal blog — the views expressed on these pages are mine alone and not those of my employer.

Back to