Monday, July 28, 2008

Trends in intelligent video analytics

Video sensory analysis is the key element in security applications, since the human observer finds it difficult to work with the increasing number of video channels without aids. New digital product and system concepts with intelligent video codecs, intelligent IP cameras and network video recorders allow the best possible coordination of system functionality with the operator and the surrounding environment. The efficiency of the security system is optimized by alarming, automatically flagging and indicating potentially risky situations with high-quality and reliable image analysis. The security personnel are thus relieved of a certain workload and can apply themselves fully to the situation displayed and make the necessary decisions.
For the trade press
Proactive security in the area of video applications means to anticipate incidents in order to initiate specific interventions beforehand. This requires intelligent video analysis procedures that permanently examine the available camera signals, without fatigue symptoms, for relevant objects and provide relief for the observer from the tide of information. An experiment in the USA showed that a human observer of two monitors with automatic image switching overlooked up to 45 percent of all activities in the scenes after 12 minutes. 22 minutes later it is already up to 95 percent. Therefore, the demand for intelligent image analysis in video-based security systems is continuously growing. A recent study of the English market research company IMS Research predicts a market growth from approx. 100 million US dollars to more than 800 million US dollars in 2010. This demand pull, and the trend toward completely digital video systems, require new product and solution concepts from manufacturers. Surrounding environment and process chain of videoIn the still-prevalent analog technology the video sensor is connected upstream as an independent device unit of the video matrix. This sensor unit analyses the available video signal and superimposes the analysis results routinely on the image, for example, through graphical framing. When the alarm conditions are fulfilled, a corresponding alarm signal (contact, serial telegram) is transmitted to the video matrix. The video matrix hereby forms the central control unit, which carries out the alarm processing in addition to the system topology administration. In the case of an alarm, corresponding image loading is carried out and, if required, video recording is started.With the undergoing paradigm change in the field of video security from the analog technology to digital technology, this process chain is now changing. In the digital world of video over IP, the previously separately implemented device concepts are increasingly merging into the so-called intelligent video codecs (Siemens Sistore CX line). The distinguishing feature of this device type is that it digitizes the incoming video signals in real time and ideally compresses them into the MPEG-4 format. Based on the digitized data, sensory analysis (image analysis), storage (locally on an internal hard disk or on the network recorder) and further distribution (video streaming over the network) are integrated in a single device. Through the integration of these three disciplines in a single device unit, it is now possible to adjust the available processing performance of the device ideally to the particular application by "dosing" the individual disciplines accordingly (configuration of the image rate and image resolution for each discipline), with which in turn an ideal price/performance ratio is achieved.As a result of the progressive change, digital technology will be integrated at ever higher levels and shifted further into the field level. IP cameras are therefore becoming increasingly more powerful and able to perform ever more complex image processing routines. The camera thus becomes a highly integrated video sensor. Furthermore, network video recorders (NVR) are becoming increasingly effective and are able not only to carry out intelligent searches in stored image data, but can also evaluate data streams of IP cameras in real time (NOOSE – Network of optical Sensors). Structure of video sensors In line with the digital change, the inner construction of the sensors is also changing. Instead of what was so far sensory analysis working with special hardware configurations (PLCs such as FPGA can only carry out simple arithmetic operations and are complicated to program), there are now powerful digital signal processors (DSPs) that can execute considerably more demanding software algorithms. These DSPs today form the processing core of intelligent video codecs. In combination with an embedded host processor for general administration, storage and networking tasks, this core shares the work. Today, such hybrid architecture can be configured and manufactured as a very compact solution.Besides this, there are also PC-based video sensor systems that perform safety-relevant tasks and carry out the image processing via so-called frame grabbing cards or the IP streaming signal. As a result of high PC clock speeds of 3 GHz and more, it is today possible to achieve a very good performance in image processing with PCs. Particularly with highly specialized algorithms or in the university sector, this platform allows for quick implementations. Unfortunately, the life cycles of the PC-based systems are rather short, so that the availability and serviceability necessary in the video product and system business prove to be problematic.This situation reinforces the trend in the direction of highly integrated video codecs, which, with specialized signal processors and a structure optimized for media applications (low clock speeds ~300 MHz up to 1 GHz), achieve at least the same processing power as PCs. An essential advantage is the typical low power loss of <10 watt/channel of a DSP-based codec system, compared to that of a PC-based system (50-100 W/channel power loss), since with the same image resolution, image rate and processing complexity, a PC can rarely process more than 4-8 channels. As for the further development in the evolution of signal processors, it is predicted that apart from the hardware acceleration of the MPEG video compression, already realized in some processors, chips will also be available with prefabricated image analysis algorithms in future. Applications for video sensory analysis As far as the currently available video sensors and those that will be available in the near future are concerned, there is no universal algorithm in image processing to cover the entire range of application of video analysis (for example, character recognition, face recognition, object tracking, smoke detection). Nonetheless, good results are achieved if the areas of application are defined and outlined prior to the development of video sensors. The scene to be observed is typically described by a set of basic assumptions. For security applications, it is always assumed that the cameras are permanently installed as a preset version or with pan-tilt zoom. Application scenarios can be divided roughly into a group of inanimate or animated scenes. In the case of inanimate scenes, it is a matter of conventional enclosure, open space or facade monitoring for perimeter protection. Generally, the assumption here is that it concerns statistically rare events in a well-known scene and that the object to be detected behaves most "uncooperatively" and camouflages itself. However, it is especially important for this application to avoid unwanted alarms, since frequent false alarms reduce the confidence in the system and hence jeopardize the entire security. Such applications are encountered in penitentiaries, power plants, refineries or industrial premises and must fulfill the following requirements.
Detection of moving objects in front of a familiar background
High sensitivity for the detection of camouflaged objects
Quick detection (<1s) of an alarm situation
Distinction between objects by means of object size and speed
Classification of objects by means of the movement pattern (insects in front of the camera, birds, loitering)
Detection of attempts to sabotage the camera (defocusing, rotating, spraying, covering)
In perimeter protection applications, industrial estates or public buildings increasingly tend to be designed more openly. We can observe a trend toward flexible monitoring methods of building exteriors, facades, windows, entries and exits and away from static facilities such as fences. In animated scenes, such as on tracks or in visitor halls, it may be a well-known scene but it is permanently masked by moving objects. Here it is necessary to either reliably detect a change in the background (objects left behind, objects that are removed without authorization) or to extrapolate key statistic data or behavior patterns from the mass of objects moving in the foreground (density of people, people counts, people's behavior). Pan-Tilt-Zoom cameras Normally, video sensory analysis for Pan-Tilt-Zoom (PTZ) cameras is used to support the operator with automatic camera control rather than to trigger alarms. For this application, objects that are either manually selected by the operator (click and track) or detected by a second sensor camera are automatically tracked. In the latter case, the coordinates of the objects are passed to the upstream "tracking sensory analysis" of the PTZ camera for further object tracking. By not having to load the particular PTZ or dome camera and to track the object manually, the process of monitoring is more efficiently configured for the operator. The operator still needs to choose an object or to acknowledge the tracking. Through appropriate positioning of additional PTZ cameras, the object is automatically passed to the subsequent camera for further tracking as soon as it leaves the range of view of a camera.However, when using PTZ cameras in combination with digital technology, completely new applications in the detection sector arise. Through the pivoting range, the entire surroundings can be scanned with a PTZ camera and merged to a single picture, a kind of panoramic image. Again, by periodically repeating this procedure it is possible to detect image variations and track objects.Real-time analysis or archive searchFor some video-based applications, it is not possible to define alarm criteria beforehand. During warehouse monitoring, processes are thus digitally recorded over a long period of time. Using an off-line video sensor function, the stored video data is then analyzed to search for stolen goods. Again, the efficiency is the user's primary objective. He still needs to define the type and the area of the analysis. By means of sensory analysis, all relevant events are automatically detected and the review of long video sequences is avoided. The choice between using sensory analysis for real-time events or for subsequent off-line searching in video archives mainly impacts the algorithms' required speed of operation. In real-time operation, the processing must be carried out within a period of 40 ms per image, in order to keep up with the incoming stream of video data. When searching video archives, this limit does not apply. Instead, the procedures are expected to operate distinctly faster than analyzing by hand. As with Sistore CX, smart search functions are executed approx. 50 times faster than manual searches.In future, even more effective options for searching in image archives will be available. During image recording, the images are preprocessed in real-time and the analysis data is archived to the image data in the form of metadata (for example, in MPEG-7 format). An indexing of relevant image contents is thereby performed. Intelligent IP cameras could add meta information (movement vectors, shape, color) for all detected objects to the video data online. A subsequent search for events or objects is put down to a metadata filtering (MPEG-7), without having to perform an extensive analysis.Pattern recognition and verificationIn addition to the classic usage of video sensors for security applications, there are a number of applications from the automation area. Examples include pattern recognition (object characteristics such as number plates, dangerous goods signs, container inscription) or image analysis procedures in the area of biometry (face localization; face, iris or finger print recognition). Contrary to the conventional security applications, the object to be detected is assumed to be cooperative. For such applications an action is usually released in favor of the object by means of video verification, for example, the approach of an automobile with authorized plates.Functionality of video sensorsToday, advanced sensors such as Sistore CX are able to real-time process images accurate to a pixel. Typically, work is performed with a so-called CIF resolution (Common Intermediate Format) of 352x288 pixels. With this more than 100,000 pixels are processed in real-time (40ms). Depending on the available processing power and the desired image rate for the evaluation, higher resolution images also can be processed. In the context of digital change, mega pixel resolution is increasingly used.Until now video sensors have typically been operated according to the principle of differential image processing. With this procedure the gray tones of two sequential video images are subtracted from each other and all static image portions (background) are removed from the image. All moving objects generate a measurable difference, predominantly at the outer edge of the objects. The attained "signal strength" is highly dependent on the image rate of the evaluation (~25 images/s) and the speed of the object. The faster the object moves, the higher the signal strength of the change is. To detect slow objects as well, a greater sensitivity must be set for the sensor, with which it is easier for disturbances or small, fast objects to trigger a false alarm.With procedures as used with Sistore CX EDS for example, not only the difference of two pictures is computed and referred to for the analysis. The higher capability of the video processors is possible with the use of the statistic procedures, which analyze the complete screen sequence image by image over a longer period of time. With this analysis, the sensor achieves an impression of the "normal status" of the scene and adapts to the background. The image to image difference is thereby no longer determined but the current image to background difference. Foreground objects are thereby extracted as a compact whole and no longer only the object borders. This approach offers a series of advantages in comparison to the differential image procedures:
A basic sensitivity for the respective scene is still configured. However, for the object detection the algorithm determines an individual, optimal threshold value for each individual pixel. With this, a reliable detection is possible, independent of the contrast conditions and the brightness distribution over the whole image, also in the range of brightness transitions.
The statistical analysis proceeds continuously during operation, i.e. the algorithm permanently optimizes its working point and adapts to the respective scene conditions (lighting changes, automobile headlights, lightning, weather changes) and camera or signal noise is automatically compensated.
The sensor sensitivity is independent of the object speed. Objects are always detected compactly as a whole, not just the object border, whereby the basic sensitivity can get lower, and the sensor adapts to the contrast conditions prevailing in the scene.
This results in a reliable detection of objects with very few false alarms.
In comparison to differential image procedures, this type of algorithmic process offers more application options such as motion detection and object tracking, detection of objects left behind or removed as well as the detection of sabotage attempts. Furthermore, typical movement patterns of objects (loitering or grouping of persons, insects, birds or flying leaves in front of the camera, snow or strong rain, classification of objects, for example, into vehicles and persons) can be derived from the information of object tracking (trajectory). Overall, these processes offer a very robust and reliable type of object detection with video sensory analysis in the professional security sector. ConfigurationNot only the functioning has improved, but also the configuration, the adjustment of the video sensor, has been strongly simplified. It has been possible to strongly reduce the steps for modeling the scene geometry and the number of necessary basic parameters, resulting in increased detection reliability. Thus perspective conditions of an area in horizontal and vertical planes are modeled for open spaces and facades (figure). In addition sensor elements such as static and dynamic virtual trip wires can be placed freely in the image (figure). As basic parameters, only a few, easily understandable values for the fine-tuning adjustment of the algorithm to the respective scene are necessary (adaptation speed for the background changes, basic sensor sensitivity, minimum object size to be detected, maximum allowed object speed).Disturbance impacts and undesired signalsDespite all the progress in video technology, there is no sensor system without unwanted signals. It is important that the user is aware of this and prepares for it accordingly. With video analysis, as with visual evaluation of an image, the real alarm detection rate and the false alarm detection rate are always negatively mutually dependent. The higher demanded sensitivity or tolerance should be in comparison to incomplete or rushed objects in not always the best environment, the greater is the chance of unwanted messages due to similarity with other objects or visually similar effects. A certain detection reliability is always dependent on a minimum false alarm rate. Essentially, video sensory analysis can only function optimally if the camera is operated in the "linear area" of the image recorder. That involves a scene lighting, which neither introduces overloading (glare) or underexposure in the image. Both cases increase the possibility of an unwanted message. The latest processes are also subject to the fundamental laws of the optics. Some disturbance can, however, be suppressed considerably better today with the statistical analysis of screen sequences. This includes global changes of light, light beams and cast shadow, cloud-drift, image noise as well as rain or snow. Outlook and further developmentThe trends of integration and miniaturization will proceed further. Soon intelligent cameras will be brought into action featuring similar processing powers for video sensory analysis as are only offered by intelligent video codecs today. In future, we can expect CMOS image recording sensors which will be able to carry out preprocessing of the video data in parallel with image formation.In a network of sensors and the three dimensional detection skills, new requirements arise for management systems. The object co-ordinates can, for example, be superimposed into site plans as dynamic icons, so that the movement pattern of an object can be easily followed on a site plan. Similar objects are thereby displayed on the site plan with the same icons. For example, objects classified as "good" without hesitation are shown up in green, while alarm-relevant objects classified as "bad" are shown up in red. Critical objects are shown up in yellow, before they trigger the alarm.In the current advanced state of algorithmics, objects are still defined through accumulation and clustering of individually detected "foreground pixels". As the next challenge, it is necessary to consider the neighbor relationship of the object's pixels when using the video sensory analysis. For this, not only the isolated individual pixels are analyzed for each pixel but also the surrounding pixels. From this environment, relative structural characteristics can be derived using image processing routines and generate so-called "characteristic vectors" from a number of characteristics. Thus, not only one 8-bit gray tone for each pixel must be evaluated but a complete set of values, depending on the environment considered, 9, 25 or even more values, whereby the required processing power of the sensor is strongly increasing.On the basis of these characteristic vectors, such typical objects as persons, dogs, cars or cyclists can already be learnt as a whole, in the development phase. For the in-field analysis with the sensor, the algorithm assigns every pixel the classes of the learned objects and can directly identify the objects learnt earlier. Thereby, static objects or objects in still images can be identified and overlapping objects can be separated. Even shadows cast from objects cannot affect the analysis.Today, these and similar new technologies are being researched in the academic field, for example when counting persons in highly animated scenes, where objects strongly overlap. The aim is to move stepwise from pure pixel processing toward picture understanding. At Building Technologies, 1,400 persons work on the research and development of innovative products. They are able to exchange information with 60,000 experts from 30 countries within the company. The main focus of innovation is found in the classical areas of control, regulation, sensory analysis and actuator technology and is always supplemented by system engineering, communications technology and human-machine interface technology for servicing and observing total building automation solutions.(Case 1)Advantages of intelligent codecs:
Integration of 3 device types in a single unit:Local storage of the video data - DVR function
High component density of 2-4 channels - per 19" height unit
Low power consumption, <10 W / channel (+ HDD power consumption)
Very good scalability – 1/4/8 channel units
Central and decentralized concepts realizable over LAN
Easy maintenance by operation software on flash ROM
High IT security via embedded OS (operating system)
High operation reliability with temperature management, protection from corrosion
High fail-safe characteristic

1 comment:

Luigi Piero said...

Excellent article. Thanks