Unifying Automatic and Interactive Matting with Pretrained ViTs

Abstract

Automatic and interactive matting largely improve image matting by respectively alleviating the need for auxiliary input and enabling object selection. Due to different settings on whether prompts exist, they either suffer from weakness in instance completeness or region details. Also, when dealing with different scenarios, directly switching between the two matting models introduces inconvenience and higher workload. Therefore, we wonder whether we can alleviate the limitations of both settings while achieving unification to facilitate more convenient use. Our key idea is to offer saliency guidance for automatic mode to enable its attention to detailed regions, and also refine the instance completeness in interactive mode by replacing the binary mask guidance with a more probabilistic form. With different guidance for each mode, we can achieve unification through adaptable guidance, defined as saliency information in automatic mode and user cue for interactive one. It is instantiated as candidate feature in our method, an automatic switch for class token in pretrained ViTs and average feature of user prompts, controlled by the existence of user prompts. Then we use the candidate feature to generate a probabilistic similarity map as the guidance to alleviate the over-reliance on binary mask. Extensive experiments show that our method can adapt well to both automatic and interactive scenarios with more light-weight framework.

Publication
In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)