Are LLMs exposing your private information? - Kritika Prakash, ex-Google

May 14, 2025

BACKGROUND:

Abbhinav (123 of AI), and co-host Debayan (Microsoft) are joined by AI researcher Kritika Prakash, now pursuing her Ph.D., at the University of Chicago.

The discussion revolves around Kritika’s academic journey before and at IIIT-Hyderabad, differential privacy in AI, and her thoughts on the need for regulations to ensure a machine learning model's trustworthiness. We discuss the privacy issues in machine learning, including the challenges of protecting textual data and the impact of large language models on a user’s privacy.

KRITIKA REVEALS:

Her visit to IISc Bangalore during eleventh grade left a lasting impression, particularly her exposure to game theory and simulations.
Influence from her father's academic background on the decision-making process.
How she discovered differential privacy through independent studies and developed a keen interest due to its theoretical foundations and real-world applications.
That differential privacy provides a mathematical guarantee for user data protection within the computational privacy field.
Working with text data poses unique challenges due to the indirect nature of revealing private information through contextual clues.
Language models trained on public data can inadvertently reveal private information, necessitating pre-and post-processing techniques to enhance privacy.
Adding noise before or after training is the primary approach for ensuring privacy in language models, but it presents evaluation difficulties.
There is a trade-off between privacy and accuracy in machine learning models.
The concept of differential privacy ensures that model training is resilient to attacks, such as membership inference and data reconstruction.
The random number-based strategy proposed for differential privacy ensures plausible deniability by introducing a 50% probability of giving a random answer.
The use of large language models (LLMs) like GPT raises concerns about hallucination and inadvertent exposure of private details from publicly trained datasets.
The correlation between generalization and inherent privacy in certain mechanisms like dropout in neural networks has theoretical support.
Real-world scenarios such as healthcare research or autonomous vehicles highlight the importance of safeguarding individuals' data while utilizing it for analysis or decision-making purposes.
Regulations need further development as machine learning advances rapidly towards addressing emerging challenges related to model trustworthiness and user's right-to-be-forgotten principles.

BEST MOMENTS:

“I did have exposure... where I can appreciate... lot of South Indian cultures.”

“I found electronics too hard... shifted to Computer science.”

“Repeating an extra year: It felt like I am not really losing time... just learning more.”

“Talking to him [father]... understanding his perspective on things... He urged me... go for this risky thing.”

“I got do computer science but my first 3 semesters I was just so bored.”

“Discovered that there is this element of strategy & games that you can work with.”

“Explored various research areas before delving into differential privacy.”

“You might not have anything to hide. But, you do have something to protect”

“Adding noise during training reduce your accuracy; it actually helps improve generalization.”

“The smaller the epsilon value indicates tighter or stronger [differential]privacy.”

“Real-world applications include healthcare research... where preserving individual [data]privacy is crucial.”

“Differential Privacy is not going cover all kinds of cases; we are looking at because this is the worst case guarantee.”

”The text itself doesn't have clear distinctions between people's data, and internet’s data; it's a huge mess just because that's how the text domain is.”

“Privacy will always be important no matter where the machine learning field is headed.”