Essay about Wongun CollectiveActivityRecognition09

What are they doing? : Collective Activity Classification Using Spatio-Temporal
Relationship Among People.
Wongun Choi
University of Michigan
Ann Arbor, USA

Khuram Shahid
University of Michigan
Ann Arbor, USA

Silvio Savarese
University of Michigan
Ann Arbor, USA

wgchoi@umich.edu

kshahid@umich.edu

silvio@eecs.umich.edu

Abstract

ognizing human actions: rather than classifying individuals in isolation, we analyze their collective behavior so as to reinforce the recognition of each individual’s actions. This paradigm is inspired by recent contributions in computer vision where semantic or geometrical contextual information is used to help recognize objects in complex scenes [14]. In this work, action classification is enhanced by taking advantage of contextual information that comes from the position, pose and the actions of multiple individuals in the surrounding area. Unlike many previous methods of human action recognition, we aim at working under unrestrictive conditions such as dynamic cluttered background, variations in illumination and viewpoint, intra class variability in the human appearance and non-static cameras.

In this paper we present a new framework for pedestrian action categorization. Our method enables the classification of actions whose semantic can be only analyzed by looking at the collective behavior of pedestrians in the scene. Examples of these actions are waiting by a street intersection versus standing in a queue. To that end, we exploit the spatial distribution of pedestrians in the scene as well as their pose and motion for achieving robust action classification. Our proposed solution employs extended
Kalman filtering for tracking of detected pedestrians in 2D
1/2 scene coordinates as well as camera parameter and horizon estimation for tracker filtering and stabilization.
We present a local spatio-temporal descriptor effective in capturing the spatial distribution of pedestrians over time as well as their pose. This descriptor captures pedestrian activity while requiring no high level scene understanding. Our work is tested against highly challenging real world pedestrian video sequences captured by low resolution hand held cameras. Experimental results on a 5-class action dataset indicate that our solution: i) is effective in classifying collective pedestrian activities; ii) is tolerant to challenging real world conditions such as variation in illumination, scale, viewpoint as well as partial occlusion and background motion; iii) outperforms state-of-the art action classification techniques.

Figure 1. Example of queueing (left) and talking (right) actions.
By just looking at one individual, it is very hard to classify whether this person is in a queue or talking. However, by looking at what the surrounding people are doing, the actions can be disambiguated. We aim to solve this problem by capturing videos using unstabilized cameras under generic viewing conditions.

1. Introduction

Our algorithm is built upon the robust detection of humans using deformable part based detector[11] and HOG descriptor [7] for classifying human poses. We introduce a new algorithm based on the Extended Kalman filter that enables robust tracking of each detected human for a number of frames. Our algorithm incorporates into the feedback loop the estimation of rough camera parameters, the scene horizon line and 2D 1/2 location of each tracked individual. This makes the recovery of the each person’s trajectory

Consider a video sequence capturing a number of individuals located in an indoor environment such as coffee shop. Imagine an algorithm that is able to process the video and answer questions such as: Are these people talking?
Are they in a queue waiting to order food or drink? By just looking at each single person it may be challenging to design an algorithm that is able to address these questions
(Fig.1). In this paper we introduce a new paradigm for rec1

in parameter space robust with respect to camera