Post

📖 Progress+gpt4o-img-gen

📖 Progress+gpt4o-img-gen

Survey

Safety at Scale: A Comprehensive Survey of Large Model Safety

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, …

ArXiv:2502.05206

Submitted on 2025/03

Intellectual Property Protection

slide 4

Paper

MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning

Yixin Liu, Chenrui Fan, Yutong Dai, Xun Chen, Pan Zhou, Lichao Sun

2024 CVPR

Problem:

  • Text-to-image models can create harmful content, risking individual safety.

  • Existing methods aim to make images “unlearnable” but have limitations.

Proposed Solution: MetaCloak

Results:

  • MetaCloak enhances image resistance to transformations (flipping, cropping, compression) by using surrogate diffusion models to craft transferable perturbations and a denoising-error maximization loss for better robustness.

Introduction

Some data protections are fragile and demonstrate limited robustness against minor data transformations such as filtering.

Design MetaCloak, a more effective and robust data protection scheme that can prevent unauthorized subject-driven text-to-image diffusion-based synthesis under data transformation.

slide 6

Problem Statement

The user’s (image protector’s) objective is to protect their image set Xc.

User injects small perturbations into images x ∈ Xc to craft a poisoned image set Xp.

The model trainers will collect and use Xp to fine-tune a text-to-image generator x̂θ, in order to obtain the optimal parameters θ*.

The overall goal:

(4) Maximize Perturbation: Find the optimal perturbation to the images can effectively confuse or hide the true content of the images, thereby preventing the generative model from correctly identifying these images.

(5) Minimize Image Recognition Capability: Adjust the parameters of the generative model, such that the model’s ability to recognize the perturbed images is minimized, allowing the model to fail to learn or identify the original image content.

slide 7

Method

Optimizing the proxy model parameters θ̃.

Minimizing the model’s recognition capability on the protected image Xp.

Using SGD → The updated protected image Xp should be as visually close to the original image X0p as possible, while also maximizing the reduction in the generative model’s recognition capability.

slide 8

Results

slide 9

Results

slide 10

GPT-4o Image Generation

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan

arXiv:2406.06525

2024/06

slide 12

GPT-4o Image Generation

Technical Principles

GPT-4o utilizes an autoregressive model, similar to how humans write, generating images step by step from the top-left corner to the bottom-right.

Compared to traditional diffusion models, this method significantly improves detail accuracy and text rendering quality, reducing random inconsistencies in images.

Model Training & Optimization

Training took over a year, as revealed by OpenAI research lead Gabriel Goh, with hundreds of human trainers refining details to enhance precision and AI comprehension.

Learning mechanism: The AI continuously improves by learning from human corrections, leading to better image quality and understanding.

Performance & Limitations

Slightly slower than DALL·E 3, but the improvement in image quality and knowledge integration makes the additional processing time worthwhile.

slide 13

Comparison: Diffusion vs. Autoregressive

### | Aspect | Diffusion Model | Autoregressive Model |

| — | — | — |

Working PrincipleStarts with random noise, then applies multi-step denoising to generate a clear imageStep-by-step generation,
determines part of the final image with each step  
Generation ProcessGlobal approach, generating the entire image at once and refining detailsLinear progression, constructing the image from top to bottom or left to right
Consistency & StabilitySince the process starts from random noise, maintaining consistency and stability is difficultMore controlled and stable,
improving semantic understanding  
ExamplesStable Diffusion, DALL·E 2/3GPT-4o (image generation)
AdvantagesCan create high-quality images,
stable and diverseAR architecture can seamlessly integrate with LLMs, better for multimodal understanding,
accurately linking text and images 
DisadvantagesLimited by random noise,
leading to inconsistent resultsHigh computational cost, and image detail may be limited by token processing constraints

GPT-4o’s Key Innovations

Balanced Speed & Quality → Optimized model structure ensures fast generation while maintaining high visual quality.

Consistency in Large Images → Advanced algorithms prevent detail inconsistencies, making images appear more natural.

Better Text-Image Alignment → Both text and images are vectorized into tokens, improving AI’s understanding and accuracy.

Possible Technical Explanations:

Token-Based Sketching → The model may first generate a rough sketch using tokens, followed by a refinement phase using a diffusion-like denoising process.

Chain-of-Thought (CoT) Style Refinement → The model could iteratively enhance the image step by step, similar to how CoT reasoning improves text generation.

Layered Input Processing → The model might generate a low-resolution draft first and then apply multiple processing steps to refine details and increase clarity.

slide 16

slide 16

參考

gpt-4o-img-gen:

https://x.com/dotey/status/1904684852982813022

https://www.facebook.com/photo/?fbid=10162841428450802&set=a.10150347633745802

https://www.threads.net/@prompt_case/post/DH0dmZcxtMt?xmt=AQGze7s1OXyuZ0F_Mh6jNpi8cGfsrCs3OcYjz-S6E8lYug

llama-gen:

https://www.threads.net/@shaochuanwang/post/DFBVLeTzeco

This post is licensed under CC BY 4.0 by the author.