Brad Neuberg: NIPS Day 3 Morning Sessions

Note that I took these notes on an iPad, which means that there might be spelling mistakes. Also, the screenshots are not of every slide, only the ones I found interesting or which had mathematical notation I wanted to capture.

The notes for different slides is collapsed below using my stretchtext.js library; simply click it to open and close it to read more about that specific area.

Deep Visual Analogy-Making

Link to paper

What they do differently is have a CNN trained end to end to learn analogical features

Seems like a hard problem

Create a function for each input image, and subtract them

Additive encoders vs. multiplicative encoder vs deep encoding

Deep model does the best vs add or multiplicative

one network that can do multiple transformations

Pretty cool results showing cartoon character being animated from transformation vector, even at different angles

Form training tuples of animations where the training samples are moving towards and backwards in time

Extrapolating animations by analogy

Then, apply these to query animation to unseen character

Disentangling car pose and appearance

End-to-End Memory Networks

Link to paper

Motivation - RNN for temporal and CNNs for spatial, but struggle with out of order access, long term dependencies, and unordered sets

Example: Q&A answering, info needed might be out of order

Long term dependencies

Propose NN model with external memory

Reads from memory with soft attention, performs multiple lookups on memory, can be trained with backprop

New model uses soft attention and only needs supervision on final output

Only need supervision on output

Dot product between memory vector and controller vector

Then tske sofmax over these, which can be considered attention

Main difference with prior work is soft vs hard attention and where supervision is needed

Better to replace bag of words with position encoding, inject random noise into time embedding, joint training

Exploring future work on writing and playing games

Spatial Transformer Networks

Link to paper

Allow networks to spatially have invariance

Define new module that has differentiable spatial module

Distorted mnist digits, right is output

Three main components

Deep Convolutional Inverse Graphics Network

Link to paper

Learn a disentangled representation with 3D pose, lighting

Use the network to parse images and generate new ones

Training Very Deep Networks

Link to paper

10s or 100s of layers very hard to deal with, gradient explodes or vanishes

Highway networks - new architecture

Every unit learns to to transform or carry mechanism, inspired by LSTM gating mechanism

Can train hundreds of layers via SGD

Where Are They Looking?

Link to paper

Gaze following

Build system able to follow gaze

Built largest 130k people data set

Gaze network

Gaze pathway uses just a crop of the head of a person to infer gaze direction

Saliency pathway uses full image

Final output is computed

Trained end to end

Gaze mask indicates gaze direction

Combine with saliency

Attention-Based Models for Speech Recogition

Link to paper

Repeated context and repeated sounds, want to scale to lower inputs and test time

Classical model at each step is to look at all frames

Add a bit location mechanism that can tell frames relative to previous location, helps with repetitive content problem

How to scale to long inputs

Solution able to decode utterances 20x longer than train set

Please note that this is my personal blog — the views expressed on these pages are mine alone and not those of my employer.

Coding in Paradise

NIPS Day 3 Morning Sessions

Back to Codinginparadise.org