SIAT Architecture

SIAT: Video Knowledge Curation Layer
Introduction

There is increasing reliance on the intelligent CCTV systems for effective analysis and interpretation of the streaming data to recognize activities and to ensure public safety. Monitoring videos captured by surveillance cameras is always a challenging and time-consuming task. There is a need for automated analysis using computer vision methods in order to extract spatial and temporal features to assist the authorities. Once videos are processed using computer vision technologies, another issue is how to index the extracted low-level features to search, analyze, and browse? How to bridge the semantic gap between the low-level features in Euclidean space and temporal relation across videos in a multi-stream environment? Similarly, how to deal with petascale video in the cloud while extracting the low-level and high-level features? In order to address such issues, in this paper, we propose a layered architecture for large-scale distributed intelligent video retrieval while exploiting deep-learning and semantic approaches called IntelliBVR. The base layer is responsible for large-scale video data curation. The second and the third layer is supposed to process and annotate videos, respectively while using deep learning on the top of distributed in-memory computing engine. Finally, the knowledge curation layer, where the extracted low-level and high-level features are mapped to the proposed ontology so that it can be searched and retrieved using semantic rich queries. Finally, we implement and show results, which project the effectiveness of IntelliBVR.

The main contributions of this work are:

  • We propose IntelliBVR, a layered framework for video big data intelligent search, retrieval, and complex event analysis.
  • For large-scale video object and event extraction, we exploit the power of deep learning while using a distributed in-memory computation engine, i.e., Apache Spark, to ensure accuracy and scalability, respectively.
  • Under the KCL, first, we standardize the basic nomenclature and then propose a semantic data model called IBVROnto, aiming to bridge the semantic gap between low-level representative features and high-level semantic contents.
  • The proposed system has been implemented and tested using distributed computing technologies, i.e., HDFS, Hbase, Kafka, and Apache Spark. We configure a cluster consisting of seven machines for the evaluation of IntelliBVR.
Background

In work utilize distributed data management and computing technologies such as Hadoop Distributed File System (HDFS), Hbase Apache Spark, and Kafka. HDFS is a distributed file system, designed to store large-scale data reliably and holds properties like fault-tolerant and scalability up to thousands of machines even on commodity hardware. HBase is a non-relational, distributed database Google’s Big Table inspired and runs on the top of HDFS. Apache Kafka is publishes-subscribe based messaging system and works as a broker among applications. Apache Spark, is an in-memory, high-speed, interactive, and distributed data computation engine.

For semantic-based video retrieval, we use semantic web technologies. In Semantic Web the data is available on the web, enriched with meaning in machine-understandable form. Semantic Web provides standards domain modeling utilizing ontologies, Resource Description Framework (RDF) for video connects annotation and semantic rules for reasoning over data. Thus, semantic technologies can be utilized to annotate and enhance the extracted features from videos.

Proposed IntelliBVR

We propose IntelliBVR for large-scale distributed intelligent surveillance video analysis in the cloud while exploiting state-of-the-art cloud computing technologies. IntelliBVR is intended to be an intelligent video retrieval and analysis service under the SIAT cloud platform. SIAT is a service oriented cloud platform, where real-time video streams and batch data are acquired, and contextual intelligent video analytics is performed in an almost real-time and offline manner. IntelliBVR is a lambda style architecture consisting of four layers, i.e., Big Data Curation Layer (BDCL), Video Data Processing Layer (VDPL), Deep Video Annotation Layer (DVAL), and Knowledge Curation Layer (KCL) as shown in Fig.1. We describe the details of each layer in the following subsections.

Big Data Curation Layer

The BDCL is the base layer of IntelliBVR, which acquires and manages large-scale real-time video streams, batch video data, and the extracted video features. The BDCL is composed of two main components, i.e., Video Stream Acquisition and Synchronization (VSAS), and Distributed Big Data Persistence (DBDP).



Fig. 1. Layered Architecture of the proposed IntelliBVR

The real-time video stream needs to be collected from the source device and forwarded to the executors for on the fly processing and video annotation. Handling a tremendous amount of video streams, both processing and storage are subject to lose. To handle, large-scale video stream acquisition in real-time and to ensure the scalability and fault-tolerance, we develop the VSAS component while using a distributed messaging system, i.e., Kafka. This component decodes the video stream, detects the frames, and then performs some necessary operations on each frame such as meta-data extraction and frame resizing, which is then converted to a formal JSON message. These messages are then serialized in the form of mini-batches, compressed, and sent to the Distributed Broker, i.e., Kafka Topic ‘t’. As the acquired video streams are now residing in the Kafka Broker’s queue in the form of mini-batches. The Video Stream Consumer Service (VSCS) is used to read the mini-batches of the video stream from the respective topic in the Kafka Broker.

The second component of the BDCL is DBDP, which is responsible for providing distributed big-data persistence to both the extracted structured ( video features, and metadata) and unstructured data (raw video streams and batch data) of IntiliBVR. The DBDP provides two levels of abstraction on the acquired data, i.e., Persistent Big Data Store (PBDS), and Feature Structured Data Store (FSDS). The PBDS is in charge of providing permanent and distributed large-scale video data storage and is built on the top of HDFS. These videos are synchronized through the metadata stored in the FSDS. The FSDS is constructed on the top of Apache Hbase. When the features are extracted (High level, low-level, and dynamic event info) from the videos (both batch and micro-batches) are indexed in FSDS. Active and passive data readers are developed to read and write real-time data and batch data to the PBDS and FSDS, respectively.

Video Data Processing Layer

VDPL is developed to perform primitive video operations on video streams (micro-batches) and batch video residing in distributed broker and PBDS, respectively. The VDPL layer includes five main components, i.e., Metadata Extractor, Key Frame Extractor, Frame Extractor, and Video Encoder, and Video Segment Extractor.

Deep Video Annotation Layer

This layer consists of two components, i.e., real-time deep video annotation, and batch deep video annotation. Both of these components are responsible for performing in-memory distributed deep video spatial and temporal annotation against the real-time video streams and batch videos.

Knowledge Curation Layer

The KCL is responsible for mapping the extracted annotations to the semantic data model, i.e., IBVROnto store for intelligent searching and browsing. This Layer is composed of three components, i-e. VideoOntoMapper, IBVROnto Store, and SPARQL Query Browser. The VideoOntoMapper maps the extracted annotation to the proposed ontology called IBVROnto. For mapping, Apache Jena has been used. We present and explain the proposed IBVROnto, and the basic terminologies in the next sub-sections.

Proposed IBVROnto

In this section, formally, we describe the details of the proposed ontology. As we analysis and annotate large-scale real-time and offline videos while utilizing deep-learning approaches to extract special and temporal features. Thus, the proposed surveillance video ontology is composed of four types of top-level concepts, i.e., VideoSource, Video, Annotation, and VideoAnnotationService. All the concepts, objects, properties, and relations among objects are shown in Fig. 2.

In the cloud environment, videos are acquired from real-time sources or upload in the form of batches. Thus, a top-level concept has been ‘VideoSource’ created, which consists of two sub-classes, i.e., ‘IPCam’, and ‘Batch’. From the user, and query perspective, these concepts are vital. Both of these sources produce videos, which are physically persisted in HDFS.

Another top-level concept has been designed called ‘Video’. In the context of video analytics, a video can temporarily be decomposed into segments. The temporal information is extracted from the videos segment. Similarly, each segment has a candidate frame called a keyframe from which the spatial information (low-level features) are extracted. Thus under the ‘Video’ concept, two sub-classes are created called ‘Segment’ and ‘KeyFrame’ which are linked through ‘spatioTemporalDecomposition’, and ‘hasKeyFrame’ relationships respectively. When video annotation services are applied, then annotation is generated.



Fig. 2. IBVROnto - Proposed surveillance video retrieval and analysis ontology.

The third type of top-level concept is ‘Annotation’, as each video produces annotations. The ‘Annotation’ concept has two sub-classes called ‘Action’, and ‘Object’, which are linked with the ‘Segment’, and ‘KeyFrame’ sub-classes by using object properties ‘producesAction’, and ‘producesObjects’ respectively. The action concept represents the behavior of different objects detected in the video segment. The action concept holds three types of data properties, i.e., actionType, actionGravity, and eventConfidence, which illustrations the action type, the severity of the action, and the action confidence in terms of probability, respectively. Similarly, the ‘Object’ sub-class concepts represent different low-level entities that have been identified in a keyframe. The keyframe objects indicate that how they interact with each other in two-dimension space. The example of the detected objects can be car, bag or some human. The ‘Object’ sub-class hold five types of data properties i.e., ‘objectCenterX’, ‘objectCenterY’, ‘objectWidth’, ‘objectHeight’, ‘objectConfidence’. The first four data properties are vital in the context of object interaction in 2D space. In contrast, the last one shows the confidence of an object being identified in keyframe in terms of probability. The objects can be of different types, and we further classify it into two of third level sub-classes called ‘Living’, and ‘NonLiving’. The examples of ‘Living’ concept are humans and animals. Similarly, the ‘NonLiving’ concepts are further classified as ‘Portable’, and ‘Mobile’ objects. The former concept is further classified as ‘Bag’, ‘Luggage’, and ‘Generic’ while the later concepts are further classified as ‘Bike’, ‘Vehicle’, and ‘Cycle’. All the sub-classes of the concept ‘Object’ are linked with the keyframe by using various object properties, as shown in Fig. 2.

Finally, the fourth type of top-level concept is called ‘VideoAnnotationService’, which represents the video analytics service been used for spatial and/or temporal annotation extraction. The ‘VideoAnnotationService’ has been linked with ‘Annotation’, ‘Action’, and ‘Object’ by using object properties ‘Produces’, ‘producesAction’, and ‘producesObject’ respectively. Furthermore, a sub-class of ‘VideoAnnotationService’ with the name ‘Algorithm’ has been created. As a video annotation service is composed of different types of pipelined algorithms. For ontology development and visualization, we utilize Protégé and VOWL plugin respectively.

IntelliBVR Evaluation and Discussion
Experimental Setup

For testing and evaluation, we set up an indoor distributed environment called SIAT cluster while deploying the Hortonworks Data Platform (HDP) version 3.1.0. The cluster consists of ten nodes, as shown in Fig. 3. ProSafe GSM7328S fully managed switches were used for networking. We set the value of different parameters in the SIAT cluster, as shown in Table 1.



Fig. 3. SIAT Cluster for evaluation.

Variable Value
Kafka BATCH_SIZE_CONFIG 20mb
COMPRESSION_TYPE_CONFIG Snappy
Replication Factor 3
HDFS Block replication HDFS 3
Block Size 64mb
Java heap size 1gb
HBase hbase.rpc.timeout 1200000
hbase.regionserver.lease.period 1200000

Table 1. Parameter Settings
Dataset

We evaluated IntelliBVR with UCF101 [35] dataset. The total dataset is divided into 101 categories, grouped into 25 groups, with each group containing four to seven videos. The frame distribution of the dataset is shown in Fig. 4.



Fig. 4. Frame and keyframe distributions (UCD101).

The dataset is divided into training and testing with the partitioning of 9,537 - 3,783. Where the former is used for training and the later are for testing.

Performance evaluation of VSAS

We register an IPCam and offline video stream sources with VSAS with default frame rate 30, and the last one is 60 frames per second respectively. In the case of an offline video stream, a video file residing on HDD (WDC WD10EZEX) has also been configured with VSAS. The VSAS sets the resolution of the acquired frame to 480 x 320 pixels. Resultantly, the size of each acquired frame became 614.495 KB. The VSAS convert the acquired frame to a formal message at the rate of 6MS and after compressing (using snappy compression), the size of the message becomes 140.538 KB on average and forward to the Broker Server at the rate of 12 MS. On average, we can acquire 34 and 54 frames per second from the IPCam video stream data sources and offline stream sources respectively. The achieved rate is 36% and 116% percent higher than the preferred, i.e., 25 FPS for real-time video stream analytics.

Distributed Persistent Big Data Store

To evaluate the DBDS performance over HDFS, we have performed experimentation on Active Reader and Writer. The HDFS instances are configured on the Worker Agents (Data Nodes) and HDFS Server (Name Node), as shown in Fig 5. Likewise, the Active and Passive Data Reader and Writer are configured on the Worker Agents. The Active Data Writer consumes the video stream from the topic Broker server and persists the video stream to the HDFS. The performance result of the Active Data Writer, i.e., blocks are written to the data node, is shown in Fig. 5. From the results, it is clear that the Active Data Writer ensures the data locality and proper data distribution.



Fig. 5. Performance of Distributed Big Data Persistent (Active Data Writer).

Scalability Testing of VDPL and DVAL

We perform the scalability testing of the proposed deep feature and activity annotation algorithms while using a spark. For this, we exploit four worker nodes. The results of both the algorithms are shown in Fig. 5. From the figure, it is clear that as we add more machines, the running time is decreasing almost linearly.



Fig. 6. Scalability testing of deep object and activity extraction and annotation using UCF101 dataset.

Semantic video retrieval using SPARQL

We configure the IBVROnto store on the SIAT’s IntelliBVR server and use SPARQL query language for search and retrieval. Space will not allow us to put too many results. However, we query for activities like typing, punch, walking, running, and biking. The results are visualized in Fig. 7.



Fig. 7. SPARQL queries against different types of activities.

Cite As

Alam, A., Khan, M. N., Khan, J., & Lee, Y. K. (2020, February). IntelliBVR-Intelligent large-scale video retrieval for objects and events utilizing distributed deep-learning and semantic approaches. In 2020 IEEE International Conference on Big Data and Smart Computing (BigComp) (pp. 28-35). IEEE.

Paper Link

https://ieeexplore.ieee.org/abstract/document/9070339