Person Segmentation in Teams

This poster presents the architectural foundations of person segmentation in Microsoft Teams. Achieving high-fidelity person segmentation in a real-time communication environment requires balancing extreme computational efficiency with temporal robustness. We detail three core innovations that enable high-quality results across diverse hardware: (a) Latency-Constrained DNAS: we utilize Differentiable Neural Architecture Search (DNAS) optimized specifically for hardware-aware latency constraints. This allows for the discovery of lightweight backbones that maximize IoU while adhering to strict millisecond-per-frame budgets, (b) On-the-Fly Scene Adaptation: To handle the variability of user environments, we introduce a mapping model designed to update model parameters dynamically. This on-the-fly adaptation allows the network to adapt to scene conditions such as lighting, significantly reducing temporal flickering and segmentation artifacts, and (c) Deep Guided Filtering: To achieve full-resolution segmentation without the cost of high-resolution encoder-decoders, we employ Deep Guided Filtering. This module processes a low-resolution initial mask and uses the high-resolution input as a guide to recover sharp edges and fine details (such as hair strands) in real-time.