A new attack framework aims to infer keystrokes typed by a target user at the opposite end of a video conference call by simply leveraging the video feed to correlate observable body movements to the text being typed.
The research was undertaken by Mohd Sabra, and Murtuza Jadliwala from the University of Texas at San Antonio and Anindya Maiti from the University of Oklahoma, who say the attack can be extended beyond live video feeds to those streamed on YouTube and Twitch as long as a webcam’s field-of-view captures the target user’s visible upper body movements.
“With the recent ubiquity of video capturing hardware embedded in many consumer electronics, such as smartphones, tablets, and laptops, the threat of information leakage through visual channel[s] has amplified,” the researchers said. “The adversary’s goal is to utilize the observable upper body movements across all the recorded frames to infer the private text typed by the target.”
To achieve this, the recorded video is fed into a video-based keystroke inference framework that goes through three stages —
- Pre-processing, where the background is removed, the video is converted to grayscale, followed by segmenting the left and right arm regions with respect to the individual’s face detected via a model dubbed FaceBoxes
- Keystroke detection, which retrieves the segmented arm frames to compute the structural similarity index measure (SSIM) with the goal of quantifying body movements between consecutive frames in each of the left and right side video segments and identify potential frames where keystrokes happened
- Word prediction, where the keystroke frame segments are used to detect motion features before and after each detected keystroke, using them to infer specific words by utilizing a dictionary-based prediction algorithm
images from Hacker News