FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs
Sepehr Dehdashtian* , Lan Wang*, Vishnu N. Boddeti
{sepehr, wanglan3, vishnu} @ msu.edu
Michigan State University
Bias in CLIP's Zero-Shot Prediction
Error: Unable to load Plotly figure.
There have been some studies that highlighted the bias problem in CLIP's zero-shot predictions.
As an example of this bias, let's consider CelebA dataset where we want to classify the hair color.
As we can see here, for predicting the hair color of the person, CLIP model has an imbalance in its performance for different genders.
Although the average accuracy looks reasonable, the gap between the performance of the worst-group and the avg is 16 percent which is a considerable number.
Types of Attribute Dependency
We categorize the attributes dependencies into two categories: in the first one, there is a spurious correlation between the target attribute Y and the sensitive attribute S.
As an example, hair color in the celeba dataset has a spurious correlation with the gender.
And in second category, we have an intrinsic dependency between these two attributes. For example, the cheekbone height is intrinsicly dependent to gender.
Most of the prior studies focused on spurious correlation and disregarded the intrinsic dependencies.
FairerCLIP
Problem Setting
Using FairerCLIP, we want to solve the following problem. Given a set of
images and their corresponding text prompts, we want to learn two encoders, one for image and one for text,
that debias the representations generated by the CLIP model with respect to a sensitive attribute S.
Training
This is the training process of FairerCLIP and its objective function.
It takes the image and text features extracted from the CLIP model and maps them to the output representation space.
In each iteration of training, a closed-form solution is calculated for each encoder, and the
generated representation is passed to the other encoder. This process continues until the convergence.
The parameters of these two encoders
(next)
are calculated to maximizes
(next)
dependency between the generated representations Z and target attribute Y.
(next)
while minimizing dependency between Z and the sensitive attribute S. Also,
(next)
for preserving or improving the accuracy of the cosine similarity based classification, FairerCLIP tries to maximize
the dependency between the generated image representations and their corresponding text prompt.
(next)
Also, tau_I and tau_T are the hyperparameters that control the strength of debiasing constraints
(next)
and tau_z controls the strength of the similarity constraint.
We use a simplified definition of Hilbert-Schmidt Independence Criterion (HSIC) as our dependence measure.
which has some attractive properties, such as practical ability to capture all linear and non-linear modes of dependencies.
The majority of the prior works assume that they have access to the ground-truth labels for the target attribute Y and the sensitive attribute S.
However, in many real-world scenarios and applications, these labels are not available.
Pseudo-Label Prediction
Therefore, We predict pseudo-labels for Y and S using the clip model.
These pseudo labels will be passed to the FairerCLIP during the training process and we update the pseudo-labels for the target attribute in each iteration of training.
FairerCLIP
Inference Overview
This is how the FairerCLIP can be used in the inference time. We use the encoders designed by FairerCLIP to debias the image and text representations generated by CLIP.
Mitigating Intrinsic Dependency
CelebA
Y: High Cheekbone
S: Gender
Error: Unable to load Plotly figure.
In mitigating intrinsic dependency in celeba dataset, we can see that the FairerCLIP is
outperforming the other methods by making Equal Opportunity Difference (EOD) almost zero
while having a relatively high avg accuracy for both CLIP ViT and CLIP ResNet.
Mitigating Spurious Correlation
W/O Labels
Error: Unable to load Plotly figure.
W/ Labels
Error: Unable to load Plotly figure.
And also in mitigating the sputrious correlation on celeba and waterbird datasets, we can see that
FairerCLIP is outperforming the baselines in both with and without ground truth labels setting in terms of worst-group accuracy (WG) and Gap which is the difference between
the avg accuracy and worst-group accuracy.
Thank you!
Error: Unable to load Plotly figure.
Thanks for watching. Please check out the paper's webpage for the code and more details using the provided QR code. Thank You.
Resume presentation
FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs Sepehr Dehdashtian* , Lan Wang*, Vishnu N. Boddeti {sepehr, wanglan3, vishnu} @ msu.edu Michigan State University