These are my notes from the morning sessions of day 3 of NIPS.
Note that I took these notes on an iPad, which means that there might be spelling mistakes. Also, the screenshots are not of every slide, only the ones I found interesting or which had mathematical notation I wanted to capture.
The notes for different slides is collapsed below using my stretchtext.js library; simply click it to open and close it to read more about that specific area.
Deep Visual Analogy-Making
What they do differently is have a CNN trained end to end to learn analogical features
Seems like a hard problem
Create a function for each input image, and subtract them
Additive encoders vs. multiplicative encoder vs deep encoding
Deep model does the best vs add or multiplicative
one network that can do multiple transformations
Pretty cool results showing cartoon character being animated from transformation vector, even at different angles
Form training tuples of animations where the training samples are moving towards and backwards in time
Extrapolating animations by analogy
Then, apply these to query animation to unseen character
Disentangling car pose and appearance
End-to-End Memory Networks
Motivation - RNN for temporal and CNNs for spatial, but struggle with out of order access, long term dependencies, and unordered sets
Example: Q&A answering, info needed might be out of order
Long term dependencies
Propose NN model with external memory
Reads from memory with soft attention, performs multiple lookups on memory, can be trained with backprop
New model uses soft attention and only needs supervision on final output
Only need supervision on output
Dot product between memory vector and controller vector
Then tske sofmax over these, which can be considered attention
Main difference with prior work is soft vs hard attention and where supervision is needed
Better to replace bag of words with position encoding, inject random noise into time embedding, joint training
Exploring future work on writing and playing games
Spatial Transformer Networks
Deep Convolutional Inverse Graphics Network
Training Very Deep Networks
Where Are They Looking?
Gaze following
Build system able to follow gaze
Built largest 130k people data set
Gaze network
Gaze pathway uses just a crop of the head of a person to infer gaze direction
Saliency pathway uses full image
Final output is computed
Trained end to end
Gaze mask indicates gaze direction
Combine with saliency
Learning to Segment Object Candidates
Attention-Based Models for Speech Recogition
Repeated context and repeated sounds, want to scale to lower inputs and test time
Classical model at each step is to look at all frames
Add a bit location mechanism that can tell frames relative to previous location, helps with repetitive content problem
How to scale to long inputs
Solution able to decode utterances 20x longer than train set
Subscribe to my RSS feed and follow me on Twitter to stay up to date on new posts.
Please note that this is my personal blog — the views expressed on these pages are mine alone and not those of my employer.