Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged. Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image. First, we propose targeted variable swapping to apply region control over latent feature maps and swap masked variables for faithful context preservation and initial semantic concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt the semantic concept into the original image in terms of target location, shape, style, and content during the image generation process. Extensive results on both human and automatic evaluation demonstrate significant improvements of our approach over baseline methods on personalized swapping. Furthermore, SwapAnything shows its precise and faithful swapping abilities across single object, multiple objects, partial object, and cross-domain swapping tasks.
Figure 2. Overview of SwapAnything on swapping a object from a source image ($I_{src}$) into a personalized concept ($<{*}>$) to get the target image ($I_{target}$). The personalized concept is first converted into textual space to be treated as concept appearance. Meanwhile, the source image is first inverted into initial noise to obtain U-Net variables (including latent feature, attention map, and attention output). Targeted variable swapping preserves the context pixels in the source image. The appearance adaptation process then utilizes these informative variables to integrate the concept into the target image.
Figure 3. Comparison on single-object swapping with baselines in their original components. SS means Object Swapping, BP means Background Preservation, and OQ means Overall Quality. SG means Object Gesture. Please zoom in for a clear visual result.
Unlike SwapAnything, DALL-E in ChatGPT can only do text-based (not personalized) editing, and it can not edit real image. In other words, user can only edit image created by DALL-E itself in previous conversation. We conduct comparison by generating a source image by DALL-E itself.
We show the human preference between results generated by our method and the baseline methods. SS means Object Swapping, BP means Background Preservation, and OQ means Overall Quality. SG means Object Gesture. For the baseline methods, PS means Photoswap; MC means MasaCtrl; BP means BlipDiffusion; DE means DreamEdit; CP means CopyPaste.
@article{gu2024swapanythiing,
title={SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing},
author={Jing Gu and Nanxuan Zhao and Wei Xiong and Qing Liu and Zhifei Zhang and He Zhang and Jianming Zhang and HyunJoon Jung and Yilin Wang and Xin Eric Wang},
year={2024},
journal={ECCV}
}