Video genre classification using compressed domain visual features

Gillespie, Warwick James

doi:10.25959/23232398.v1

whole_GillespieWarwickJames2006_thesis.pdf (17.58 MB)

Video genre classification using compressed domain visual features

thesis

posted on 2023-05-26, 20:35 authored by Gillespie, Warwick James

With the rapid growth in the prevalence of digital video in the world comes the need for efficient and effective management of such information. The field of content-based video indexing and retrieval aims to achieve this through the automatic recognition of the structure and content of video data and the indexing of both low level formative features and high level semantic features. Two of the main problems facing this field are what low level features can be used, and how to 'bridge the semantic gap' between low level features and high level understanding of a video sequence. In this thesis we propose a new method that can successfully automatically classify video shots into broadly defined video genres. The classification serves as the first step in the indexing of a video sequence and its consequent retrieval from a large database by partitioning the database into more manageable sub-units according to genre, e.g. sport, drama, scenery, news reading. As the transmission and storage of digital video is commonly in a compressed format (e.g. the MPEG-1, -2, and -4 standards), it is therefore efficient for any processing to occur in the compressed domain. In this thesis video files compressed in the MPEG-1 format are considered, although the majority of methods presented can be easily adapted for use with MPEG-2 and MPEG-4 formats. For indexing purposes an MPEG-1 file contains spatial information in DCT coefficients and temporal information using motion vectors. The reliability of the MPEG motion vectors is evaluated using a spatial block activity factor estimated from DCT coefficients, to discard the vectors which do not represent the true motion within a video sequence. The thesis also presents a robust camera motion estimation technique, based on Least Median-of-Squares regression, to minimise the influence of the outliers due to object motion and wrongly predicted motion vectors. The results produced by the proposed technique show a significant improvement in the sensitivity to object motion when compared to those produced by an M-estimator technique. Robust motion intensity metrics are also presented for camera and object motion, calculated from the estimated camera model and the MPEG motion vector field after the filtering of unreliable vectors. A novel metric based on the activity factor used in the motion vector field filtering called activity power flow is introduced to effectively capture the spatio-temporal evolution of scenes through a video shot. These shot-based, low-level, global features represent both the spatial content of a shot, and the motion in a shot, both due to movement of the camera, and also of objects. The thesis also compares several machine learning techniques to transform low level visual features into high level semantics, in particular Radial Basis Function (RBF) networks with a focus on a tree-based RBF network. In this network, the result of a binary classification tree is used to configure and to initialise the structure of the RBF network. Video shots in a database are classified into four video genres: Sport, News, Scenery, and Drama. This is believed to be the first shot based video classification algorithm and the first method which uses only compressed domain features. Experimental results show that this method is both efficient, as processing is undertaken in the compressed domain, and effective, providing a classification accuracy which is comparable, where possible, to previous techniques. For the genre set {sport, news, cartoon, commercial, music} the best classification accuracies seen in previous works are 83.1% [131] using just visual features and 87% [133] using combined audio and visual features compared with a classification accuracy of 83.6% presented in this thesis. For the genre set (sport, news, cartoon, drama, music } previous work [134] reported classification accuracies of 72.0% for visual features and 88.8% using combined audio and visual features compared with a classification accuracy of 86.2 % in this thesis.

History

Publication status

Unpublished

Rights statement

Copyright 2006 the Author - The University is continuing to endeavour to trace the copyright owner(s) and in the meantime this item has been reproduced here in good faith. We would be pleased to hear from the copyright owner(s). Thesis (PhD)--University of Tasmania, 2006. Includes bibliographical references. Ch. 1. Introduction -- Ch. 2. Content-based video indexing and retrieval -- Ch. 3. Semantic video processing -- Ch. 4. Compressed domain video analysis -- Ch. 5. Low-level visual features -- Ch. 6. High-level semantic classification -- Ch. 7. Video genre classification results -- Ch. 8. Conclusions and further research