Access the full text.
Sign up today, get DeepDyve free for 14 days.
W Jiang, A Loui, P Lei (2012)
A consumer video search system by audio-visual concept classification
B Kulis, K Grauman (2012)
Kernelized locality-sensitive hashingIEEE Transact Pattern Anal Mach Intell, 34
W Liu, J Wang, S Kumar, SF Chang (2011)
Hashing with graphs
S Maji, A Berg, J Malik (2008)
Classification using intersection kenrel support vector machines is efficient
K Mikolajczyk, C Schmid (1995)
A performance evaluation of local descriptorsIEEE Transact Pattern Anal Mach Intell, 27
J Uijings, A Smeulders, R Scha (2010)
Real-time visual concept classificationIEEE Transact Multimed, 12
M Marszałek, I Laptev, C Schmid (2009)
Actions in context
R Salakhutdinov, G Hinton (2009)
Semantic hashingInt J Approx Reason, 50
V Chandrasekhar, G Takacs, D Chen, S Tsai, Y Reznik, R Grzeszczuk, B Girod (2012)
Compressed histogram of gradients: a low-bitrate descriptorInt J Comput Vis, 96
Mobile content-based multimedia analysis has attracted much attention with the growing popularity of high-end mobile devices. Most previous systems focus on mobile visual search, i.e., to search images with visually duplicate or near-duplicate objects (e.g., products and landmarks). There remains a strong need for effective mobile video classification solutions, where videos that are not visually duplicate or near-duplicate but are from similar high-level semantic categories can be identified. In this work, we develop a mobile video classification system based on multi-modal analysis. On the mobile side, both visual and audio features are extracted from the input video, and these features are further compressed into compact hash bits for efficient transmission. On the server side, the received hash bits are used to compute the audio and visual Bag-of-Words representations for multi-modal concept classification. We propose a novel method where hash functions are learned based on the multi-modal information from the visual and audio codewords. Compared with traditional ways of computing visual-based and audio-based hash functions based on raw visual and audio local features separately, our method exploits the co-occurrences of audio and visual codewords as augmenting information and significantly improves the classification performance. The cost budget of our system for mobile data storage, computation, and transmission is similar to that in state-of-the-art mobile visual search systems. Extensive experiments over 10,000 YouTube videos show that our system can achieve similar classification accuracy with conventional server-based video classification systems using uncompressed raw descriptors.
International Journal of Multimedia Information Retrieval – Springer Journals
Published: Dec 11, 2012
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.