Behaviour recognition technology by multimodal AI
Application example for COVID-19 measures
What is behaviour recognition technology that utilizes multimodal AI technology?
The word “modal” means the type of information fed to AI (video, sound, text, etc.), and “multi-modal” AI refers to AI that uses different types of input information. Behaviour recognition is a technology for detecting and grasping human behaviour from the input modal, and it can be used in various business situations such as detecting dangerous behaviour and issuing an alert. In addition, images are mainly used as input information and it became possible to extract information such as “position of objects reflected in images” and “positions of human skeletons “with an advanced AI technology. NTT DATA regards the various information extracted as modal and feeds it to multimodal AI by combining modals obtained in other formats such as sound in order to develop behaviour recognition technology to grasp more detailed behaviour such as “when”, “where”, and “what”.
COVID-19 measures utilizing multimodal AI technology
One of the measures against COVID-19 in offices is cleaning of shared items in conference rooms and operation rooms. Since cleaning work must be performed immediately after each use, the operation is performed ideally by the user himself/herself rather than a specialized cleaning personnel, which may result in insufficient cleaning. In order to solve this problem, behaviour recognition method to automatically detect whether the item is cleaned or not by detecting used items and cleaned items from the video and issuing alerts to users and office managers is developed. In order to detect the presence or absence of cleaning omissions, in addition to information on what a person is doing, information on which items is used is required. In order to achieve this, technology was developed to extract information on the position of users and items based on AI that inputs images in the first step, and in the second step, to grasp the mutual relationship between users and items using multimodal AI that inputs the position information of users and items along with the images. The below images show results of actual detection of the used item which was not cleaned. The part surrounded by the frame is the detection result of the position of the user or the item, and the line connecting the user and the item is the detection result of the mutual relationships of what action is to be performed on what item. In addition, the used item is framed in red while the cleaned item is framed in green. In this example, the frame of the chair remains red meaning it was not cleaned and cleaning omission was detected.
As an example of the development of behaviour recognition technology utilizing multimodal AI, other than the above-mentioned scenario, it is possible to detect public nuisance act by identifying the person who talks loudly in public from the combination of video and sound modals.
NTT DATA establishes a modal selection method according to the content of behaviour recognition, which requires different requirements for each operation, and a method for constructing multimodal AI using that input and it also proceeds with methodology development of system including device and terminal for the actual AI operation. Going forward, NTT DATA will continue to develop technology to meet the various needs of the business operation and respond swiftly.