SIAT Architecture

SIAT: Content Based Video Retrieval
Introduction

Retrieval and browsing of video data manually are laborious and time-intensive tasks from the users' perspective. During the retrieval process, the primary objective is to present the videos of interest to the user. This broad domain of research is referred to as content-based video retrieval (CBVR). It has a wide range of promising applications in various domains such as law enforcement, crime Investigation & prevention, web searching, journalism and advertisement, medical diagnosis, education & training, etc.

Most of the research work in the literature primarily focused on specific aspects of video retrieval without considering the power and effectiveness of distributed deep learning methodology. Deep convolutional neural networks have led to a breakthrough in computer vision domains. However, Deep CNN comes at a cost in terms of computation. This issue can easily be addressed effectively and properly with in-memory distributed computing. One of the primary challenges that video data poses in terms of processing even for big-data technologies is its sheer volume. Video operations are far more expensive and time-consuming compared to other types of data because video data tend to be "noisy, unsegmented, high entropy and often multi-dimensional". High redundancy in videos is another issue to be addressed which is caused by the spatiotemporal correlation. Distributed in-memory computation combined with deep learning can resolve these challenges. In this work, we propose FALKON, a novel CBVR system that harnesses the power of Big-Data processing tools, deep learning, and distributed in-memory computation framework. Our proposed system is composed of four modules: video data preprocessing, deep feature learning, indexing, and retrieval which will be discussed in the following sections.

The main contributions of this work are:

  • We propose Modular, Plugable, Layered architecture for large-scale video data retrieval which supports both live and offline video processing.
  • We provide a video data abstraction mechanism for spark in-memory computation.
  • We use distributed in-memory (spark) based deep learning frameworks for accuracy, efficiency, fault tolerance, and scalability.
  • We provide support for multi-type deep feature learning for video retrieval.
  • We enhance Video Retrieval Query Map (VRQM) concept to improve the retrieval accuracy and usability, as well as a user feedback mechanism.
  • We implement the proposed system using distributed data management and distributed in-memory computing technologies. For evaluation, we establish an in-house cloud setup consisting of ten machines. We evaluate our system using three benchmark datasets. The experimental results show FALKON outperforms basic solutions with satisfactory human retrieval accuracy.
Proposed FALKON Architecture

FALKON is intended to be a service under the SIAT platform \cite{uddin2019siat} where real-time video streams and batch data is acquired, and contextual intelligent video analytics is performed in an almost real-time and offline manner. FALKON is a lambda style architecture consisting of four layers: Big-Data Curation Layer (BDCL), Video Data Processing Layer (VDPL), Video Data Mining Layer (VDML), and Web Service Layer (WSL), as shown in Fig. 1.

BDCL is the foundation layer of FALKON which acquires and manages large-scale real-time video streams and batch video data using a distributed messaging system and file system respectively to ensure scalability. VDPL performs video processing operations and consists of three components: Video Grabber, Structure Analyzer, and Feature Extractor. Video Grabber acquires video data from the video stream distributed broker in case of streaming, and video datasets from the Big-Data Store in case of batch processing. The video data is then fed to the Structure Analyzer component to perform some initial operations such as frame & metadata extraction, preprocessing, etc. The extracted frames of interest are then forwarded to the Feature Extractor component for deep spatial feature extraction and temporal feature extraction. VDML is responsible for Video Retrieval Query Map (VRQM) generation and similarity operations. Finally, we provide a web-based user interface for content-based video retrieval through the Web Service Layer.



Fig. 1. Layered Architecture of the proposed FALKON System

Big Data Curation Layer

Big Data Curation layer (BDCL) forms the foundation layer of FALKON. It is responsible for acquiring and managing large-scale video streams from connected sources such as surveillance cameras etc., and offline video data. It provides a distributed storage for indexing and retrieval of the extracted features against video data.

Video Data Processing Layer

Video Data Processing Layer (VDPL) is responsible for pre-processing and deep feature learning on video streams (micro-batches) and batch videos residing in the distributed broker \& DVDS respectively. VDPL is composed of three components: Video Grabber (VGR), Structure Analyzer (SAN), and Deep Feature Extractor (FEX).

After preprocessing, the spatial features are computed using VGG16 based deep Spatial Feature Extractor (SFE). Temporal features are computed using Volume Local Binary Pattern (VLBP) based Temporal Feature Extractor (TFE). The frames in the VidRDD are fed to the FEX component to extract the deep features. The keyframe features are then aggregated, yielding the video features. The extracted features are then stored in the FeatureRDD (FeRDD). FeRDD holds the spatial features and objects being detected by SFE and Deep Object Extractor respectively. Once the FeRDD is computed, it is immediately sent to the Feature Indexer module to be indexed in the Feature Data Store (FDS). This process is shown in Fig. 2.



Fig. 2. Distributed in-memory video operations

Video Data Mining Layer

Video Data Mining Layer (VDML) is responsible for carrying out data mining operations on videos such as object detection, classification, and feature similarly measure. It consists of three modules: Deep Object Extractor (DOE), Query Map Generator (QMG), and Feature Similarity Estimator (FSE).

Web Service Layer

To interact with the system, we designed a Web Service Layer, which allows users to search \& retrieve the video content. WSL facilitates users with two types of queries: query by image, and query by a short video clip. Once the user uploads the query image or video clip, it is processed by the system, the relevant videos are sorted out by a ranking function, and returned to the user along with the query map. The query map provides insights to the users, using which they can fine-tune their search.

FALKON Evaluation and Discussion
Experimental Setup

For testing and evaluation, we built an inhouse distributed cloud environment called SIAT cluster consisting of five machines. Hortonworks Data Platform (HDP) version 3.1.0 is deployed on the SIAT cluster. The structure of the cluster and the specification of each node is given in Fig. 3.



Fig. 3. SIAT Cluster for evaluation.

Dataset

We evaluated our system with UCF101 dataset which consists of realistic action videos collected from YouTube with large variations in camera motion, object appearance & pose, object scale, viewpoint, cluttered background, illumination conditions, etc. The dataset is divided into 101 categories, grouped into 25 groups, with each group containing 4-7 videos. All these characteristics make it one of the most challenging datasets. The frames and keyframes distribution of the dataset is shown in Fig. 4.

Dataset

We evaluated IntelliBVR with UCF101 [35] dataset. The total dataset is divided into 101 categories, grouped into 25 groups, with each group containing four to seven videos. The frame distribution of the dataset is shown in Fig. 4.



Fig. 4. UCF-101 frame and keyframe distribution.

We divide the dataset into indexing and testing datasets with a partitioning of 60-40, where 60% of the videos are used for indexing, and 40% of the videos are used for testing. To speed up the partitioning, we developed a utility to perform the division automatically based on user-defined criteria. This utility takes the dataset as a whole, and divides it into indexing and testing datasets, while carefully dividing videos of each group in each category, and ensuring that no same video appears in both indexing and testing datasets.

Query Processing

To make FALKON user-friendly and flexible, we designed it to support two kinds of queries that are processed in different ways. For a query video clip, we analyze the query clip first by extracting its frames and then extracting the features for matching. For a query image, we use spatial features only. In both cases, the same feature extraction algorithm is used to compute feature vectors, which are then looked up in the index.

Performance evaluation of VSAS

We register IPCams and offline video stream sources with VSAS with the default frame rate of 30FPS. In the case of an offline video stream, a video file residing on HDD is configured with VSAS. The VSAS converts the acquired frame to a formal message at the rate of 6ms and after compressing (using snappy compression), the size of the message reduces to 140.538KB approximately. It is then forwarded to the broker server at the rate of 12ms. On average, we can acquire 34 and 54 frames per second from the IPCam video stream data sources and offline stream sources respectively. The achieved rate is 36% and 116% percent. Fig. 5 shows VSAS performance testing results.



Fig. 5. VSAS performance evaluation

To make FALKON user-friendly and flexible, we designed it to support two kinds of queries that are processed in different ways. For a query video clip, we analyze the query clip first by extracting its frames and then extracting the features for matching. For a query image, we use spatial features only. In both cases, the same feature extraction algorithm is used to compute feature vectors, which are then looked up in the index. Fig. 6 shows the processing time of individual operations.

Evaluation discussion
We evaluated the indexing time results for the dataset videos and obtained an average processing time of 9187.43s for a single node. The processing includes the entire end-to-end flow, starting from video loading, structure analysis, and feature extraction, etc. We performed the same experiments by increasing the number of nodes gradually. Fig. 6 shows the processing time for individual operations in terms of the number of frames.


Fig. 6. Processing time of individual operations

Fig. 7 shows the scalability testing results of the processing time of the end-to-end flow. For similarity search and video retrieval, FALKON also delivers very good retrieval performance. The similarity search is based on a predefined threshold value. We evaluated the accuracy for multiple threshold values and found that a threshold of 0.75 performs better than all others.



Fig. 7. Scalability results

For a query clip, we obtained a mean average precision score of 97.3%. The sample video query and results are shown in Fig. 8.



Fig. 8. Sample query example

Cite As

Khan, M. N., Alam, A., & Lee, Y. K. (2020, February). FALKON: Large-Scale Content-Based Video Retrieval Utilizing Deep-Features and Distributed In-memory Computing. In 2020 IEEE International Conference on Big Data and Smart Computing (BigComp) (pp. 36-43). IEEE.

Paper Link

https://ieeexplore.ieee.org/abstract/document/9070609