We introduce the task of directly modeling a visually intelligent agent.
Computer vision typically focuses on solving various subtasks related to visual
intelligence. We depart from this standard approach to computer vision; instead
we directly model a visually intelligent agent. Our model takes visual
information as input and directly predicts the actions of the agent. Toward
this end we introduce DECADE, a large-scale dataset of ego-centric videos from
a dog's perspective as well as her corresponding movements. Using this data we
model how the dog acts and how the dog plans her movements. We show under a
variety of metrics that given just visual input we can successfully model this
intelligent agent in many situations. Moreover, the representation learned by
our model encodes distinct information compared to representations trained on
image classification, and our learned representation can generalize to other
domains. In particular, we show strong results on the task of walkable surface
estimation by using this dog modeling task as representation learning.