Winamp Logo
AXRP - the AI X-risk Research Podcast Cover
AXRP - the AI X-risk Research Podcast Profile

AXRP - the AI X-risk Research Podcast

English, Technology, 1 season, 42 episodes, 3 days, 2 hours, 30 minutes
About
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
Episode Artwork

37 - Jaime Sevilla on AI Forecasting

Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/10/04/episode-37-jaime-sevilla-forecasting-ai.html   Topics we discuss, and timestamps: 0:00:38 - The pace of AI progress 0:07:49 - How Epoch AI tracks AI compute 0:11:44 - Why does AI compute grow so smoothly? 0:21:46 - When will we run out of computers? 0:38:56 - Algorithmic improvement 0:44:21 - Algorithmic improvement and scaling laws 0:56:56 - Training data 1:04:56 - Can scaling produce AGI? 1:16:55 - When will AGI arrive? 1:21:20 - Epoch AI 1:27:06 - Open questions in AI forecasting 1:35:21 - Epoch AI and x-risk 1:41:34 - Following Epoch AI's research   Links for Jaime and Epoch AI: Epoch AI: https://epochai.org/ Machine Learning Trends dashboard: https://epochai.org/trends Epoch AI on X / Twitter: https://x.com/EpochAIResearch Jaime on X / Twitter: https://x.com/Jsevillamol   Research we discuss: Training Compute of Frontier AI Models Grows by 4-5x per Year: https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year Optimally Allocating Compute Between Inference and Training: https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training Algorithmic Progress in Language Models [blog post]: https://epochai.org/blog/algorithmic-progress-in-language-models Algorithmic progress in language models [paper]: https://arxiv.org/abs/2403.05812 Training Compute-Optimal Large Language Models [aka the Chinchilla scaling law paper]: https://arxiv.org/abs/2203.15556 Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data [blog post]: https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data Will we run out of data? Limits of LLM scaling based on human-generated data [paper]: https://arxiv.org/abs/2211.04325 The Direct Approach: https://epochai.org/blog/the-direct-approach   Episode art by Hamish Doodles: hamishdoodles.com
10/4/20241 hour, 44 minutes, 25 seconds
Episode Artwork

36 - Adam Shai and Paul Riechers on Computational Mechanics

Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/09/29/episode-36-adam-shai-paul-riechers-computational-mechanics.html   Topics we discuss, and timestamps: 0:00:42 - What computational mechanics is 0:29:49 - Computational mechanics vs other approaches 0:36:16 - What world models are 0:48:41 - Fractals 0:57:43 - How the fractals are formed 1:09:55 - Scaling computational mechanics for transformers 1:21:52 - How Adam and Paul found computational mechanics 1:36:16 - Computational mechanics for AI safety 1:46:05 - Following Adam and Paul's research   Simplex AI Safety: https://www.simplexaisafety.com/   Research we discuss: Transformers represent belief state geometry in their residual stream: https://arxiv.org/abs/2405.15943 Transformers represent belief state geometry in their residual stream [LessWrong post]: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer: https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why   Episode art by Hamish Doodles: hamishdoodles.com
9/29/20241 hour, 48 minutes, 27 seconds
Episode Artwork

New Patreon tiers + MATS applications

Patreon: https://www.patreon.com/axrpodcast MATS: https://www.matsprogram.org Note: I'm employed by MATS, but they're not paying me to make this video.
9/28/20245 minutes, 32 seconds
Episode Artwork

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat to Peter Hase about his research into these questions. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/08/24/episode-35-peter-hase-llm-beliefs-easy-to-hard-generalization.html   Topics we discuss, and timestamps: 0:00:36 - NLP and interpretability 0:10:20 - Interpretability lessons 0:32:22 - Belief interpretability 1:00:12 - Localizing and editing models' beliefs 1:19:18 - Beliefs beyond language models 1:27:21 - Easy-to-hard generalization 1:47:16 - What do easy-to-hard results tell us? 1:57:33 - Easy-to-hard vs weak-to-strong 2:03:50 - Different notions of hardness 2:13:01 - Easy-to-hard vs weak-to-strong, round 2 2:15:39 - Following Peter's work   Peter on Twitter: https://x.com/peterbhase   Peter's papers: Foundational Challenges in Assuring Alignment and Safety of Large Language Models: https://arxiv.org/abs/2404.09932 Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: https://arxiv.org/abs/2111.13654 Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: https://arxiv.org/abs/2301.04213 Are Language Models Rational? The Case of Coherence Norms and Belief Revision: https://arxiv.org/abs/2406.03442 The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: https://arxiv.org/abs/2401.06751   Other links: Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): https://arxiv.org/abs/1711.11279 Locating and Editing Factual Associations in GPT (aka the ROME paper): https://arxiv.org/abs/2202.05262 Of nonlinearity and commutativity in BERT: https://arxiv.org/abs/2101.04547 Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: https://arxiv.org/abs/2306.03341 Editing a classifier by rewriting its prediction rules: https://arxiv.org/abs/2112.01008 Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): https://arxiv.org/abs/2212.03827 Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: https://arxiv.org/abs/2312.09390 Concrete problems in AI safety: https://arxiv.org/abs/1606.06565 Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: https://arxiv.org/abs/2103.03872   Episode art by Hamish Doodles: hamishdoodles.com
8/24/20242 hours, 17 minutes, 24 seconds
Episode Artwork

34 - AI Evaluations with Beth Barnes

How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html   Topics we discuss, and timestamps: 0:00:37 - What is METR? 0:02:44 - What is an "eval"? 0:14:42 - How good are evals? 0:37:25 - Are models showing their full capabilities? 0:53:25 - Evaluating alignment 1:01:38 - Existential safety methodology 1:12:13 - Threat models and capability buffers 1:38:25 - METR's policy work 1:48:19 - METR's relationships with labs 2:04:12 - Related research 2:10:02 - Roles at METR, and following METR's work   Links for METR: METR: https://metr.org METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/ METR - Hiring: https://metr.org/hiring Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/   Other links: Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/ Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566 Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/ ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428   Episode art by Hamish Doodles: hamishdoodles.com
7/28/20242 hours, 14 minutes, 2 seconds
Episode Artwork

33 - RLHF Problems with Scott Emmons

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html Topics we discuss, and timestamps: 0:00:33 - Deceptive inflation 0:17:56 - Overjustification 0:32:48 - Bounded human rationality 0:50:46 - Avoiding these problems 1:14:13 - Dimensional analysis 1:23:32 - RLHF problems, in theory and practice 1:31:29 - Scott's research program 1:39:42 - Following Scott's research   Scott's website: https://www.scottemmons.com Scott's X/twitter account: https://x.com/emmons_scott When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747   Other works we discuss: AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752 Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394 Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475 The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693   Episode art by Hamish Doodles: hamishdoodles.com
6/12/20241 hour, 41 minutes, 24 seconds
Episode Artwork

32 - Understanding Agency with Jan Kulveit

What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: axrp.net/episode/2024/05/30/episode-32-understanding-agency-jan-kulveit.html Topics we discuss, and timestamps: 0:00:47 - What is active inference? 0:15:14 - Preferences in active inference 0:31:33 - Action vs perception in active inference 0:46:07 - Feedback loops 1:01:32 - Active inference vs LLMs 1:12:04 - Hierarchical agency 1:58:28 - The Alignment of Complex Systems group   Website of the Alignment of Complex Systems group (ACS): acsresearch.org ACS on X/Twitter: x.com/acsresearchorg Jan on LessWrong: lesswrong.com/users/jan-kulveit Predictive Minds: Large Language Models as Atypical Active Inference Agents: arxiv.org/abs/2311.10215   Other works we discuss: Active Inference: The Free Energy Principle in Mind, Brain, and Behavior: https://www.goodreads.com/en/book/show/58275959 Book Review: Surfing Uncertainty: https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/ The self-unalignment problem: https://www.lesswrong.com/posts/9GyniEBaN3YYTqZXn/the-self-unalignment-problem Mitigating generative agent social dilemmas (aka language models writing contracts for Minecraft): https://social-dilemmas.github.io/   Episode art by Hamish Doodles: hamishdoodles.com
5/30/20242 hours, 22 minutes, 29 seconds
Episode Artwork

31 - Singular Learning Theory with Daniel Murfet

What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:26 - What is singular learning theory? 0:16:00 - Phase transitions 0:35:12 - Estimating the local learning coefficient 0:44:37 - Singular learning theory and generalization 1:00:39 - Singular learning theory vs other deep learning theory 1:17:06 - How singular learning theory hit AI alignment 1:33:12 - Payoffs of singular learning theory for AI alignment 1:59:36 - Does singular learning theory advance AI capabilities? 2:13:02 - Open problems in singular learning theory for AI alignment 2:20:53 - What is the singular fluctuation? 2:25:33 - How geometry relates to information 2:30:13 - Following Daniel Murfet's work   The transcript: https://axrp.net/episode/2024/05/07/episode-31-singular-learning-theory-dan-murfet.html Daniel Murfet's twitter/X account: https://twitter.com/danielmurfet Developmental interpretability website: https://devinterp.com Developmental interpretability YouTube channel: https://www.youtube.com/@Devinterp   Main research discussed in this episode: - Developmental Landscape of In-Context Learning: https://arxiv.org/abs/2402.02364 - Estimating the Local Learning Coefficient at Scale: https://arxiv.org/abs/2402.03698 - Simple versus Short: Higher-order degeneracy and error-correction: https://www.lesswrong.com/posts/nWRj6Ey8e5siAEXbK/simple-versus-short-higher-order-degeneracy-and-error-1   Other links: - Algebraic Geometry and Statistical Learning Theory (the grey book): https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A - Mathematical Theory of Bayesian Statistics (the green book): https://www.routledge.com/Mathematical-Theory-of-Bayesian-Statistics/Watanabe/p/book/9780367734817 In-context learning and induction heads: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html - Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity: https://arxiv.org/abs/2106.15933 - A mathematical theory of semantic development in deep neural networks: https://www.pnas.org/doi/abs/10.1073/pnas.1820226116 - Consideration on the Learning Efficiency Of Multiple-Layered Neural Networks with Linear Units: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4404877 - Neural Tangent Kernel: Convergence and Generalization in Neural Networks: https://arxiv.org/abs/1806.07572 - The Interpolating Information Criterion for Overparameterized Models: https://arxiv.org/abs/2307.07785 - Feature Learning in Infinite-Width Neural Networks: https://arxiv.org/abs/2011.14522 - A central AI alignment problem: capabilities generalization, and the sharp left turn: https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization - Quantifying degeneracy in singular models via the learning coefficient: https://arxiv.org/abs/2308.12108   Episode art by Hamish Doodles: hamishdoodles.com
5/7/20242 hours, 32 minutes, 7 seconds
Episode Artwork

30 - AI Security with Jeffrey Ladish

Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:38 - Fine-tuning away safety training 0:13:50 - Dangers of open LLMs vs internet search 0:19:52 - What we learn by undoing safety filters 0:27:34 - What can you do with jailbroken AI? 0:35:28 - Security of AI model weights 0:49:21 - Securing against attackers vs AI exfiltration 1:08:43 - The state of computer security 1:23:08 - How AI labs could be more secure 1:33:13 - What does Palisade do? 1:44:40 - AI phishing 1:53:32 - More on Palisade's work 1:59:56 - Red lines in AI development 2:09:56 - Making AI legible 2:14:08 - Following Jeffrey's research   The transcript: axrp.net/episode/2024/04/30/episode-30-ai-security-jeffrey-ladish.html Palisade Research: palisaderesearch.org Jeffrey's Twitter/X account: twitter.com/JeffLadish   Main papers we discussed: - LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B: arxiv.org/abs/2310.20624 - BadLLaMa: Cheaply Removing Safety Fine-tuning From LLaMa 2-Chat 13B: arxiv.org/abs/2311.00117 - Securing Artificial Intelligence Model Weights: rand.org/pubs/working_papers/WRA2849-1.html   Other links: - Llama 2: Open Foundation and Fine-Tuned Chat Models: https://arxiv.org/abs/2307.09288 - Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693 - Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models: https://arxiv.org/abs/2310.02949 - On the Societal Impact of Open Foundation Models (Stanford paper on marginal harms from open-weight models): https://crfm.stanford.edu/open-fms/ - The Operational Risks of AI in Large-Scale Biological Attacks (RAND): https://www.rand.org/pubs/research_reports/RRA2977-2.html - Preventing model exfiltration with upload limits: https://www.alignmentforum.org/posts/rf66R4YsrCHgWx9RG/preventing-model-exfiltration-with-upload-limits - A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution: https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html - In-browser transformer inference: https://aiserv.cloud/ - Anatomy of a rental phishing scam: https://jeffreyladish.com/anatomy-of-a-rental-phishing-scam/ - Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing   Episode art by Hamish Doodles: hamishdoodles.com
4/30/20242 hours, 15 minutes, 44 seconds
Episode Artwork

29 - Science of Deep Learning with Vikrant Varma

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast   Topics we discuss, and timestamps: 0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS   0:00:36 - What is CCS?   0:09:54 - Consistent and contrastive features other than model beliefs   0:20:34 - Understanding the banana/shed mystery   0:41:59 - Future CCS-like approaches   0:53:29 - CCS as principal component analysis 0:56:21 - Explaining grokking through circuit efficiency   0:57:44 - Why research science of deep learning?   1:12:07 - Summary of the paper's hypothesis   1:14:05 - What are 'circuits'?   1:20:48 - The role of complexity   1:24:07 - Many kinds of circuits   1:28:10 - How circuits are learned   1:38:24 - Semi-grokking and ungrokking   1:50:53 - Generalizing the results 1:58:51 - Vikrant's research approach 2:06:36 - The DeepMind alignment team 2:09:06 - Follow-up work   The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html Vikrant's Twitter/X account: twitter.com/vikrantvarma_   Main papers:  - Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029  - Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390   Other works discussed:  - Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827 - Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit - Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1 - Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY - Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4 - Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast - Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177 - Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306 - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521   Episode art by Hamish Doodles: hamishdoodles.com
4/25/20242 hours, 13 minutes, 46 seconds
Episode Artwork

28 - Suing Labs for AI Risk with Gabriel Weil

How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is developed safely - but another approach is possible. In this episode, Gabriel Weil talks about his proposal to modify tort law to enable people to sue AI companies for disasters that are "nearly catastrophic". Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast   Topics we discuss, and timestamps: 0:00:35 - The basic idea 0:20:36 - Tort law vs regulation 0:29:10 - Weil's proposal vs Hanson's proposal 0:37:00 - Tort law vs Pigouvian taxation 0:41:16 - Does disagreement on AI risk make this proposal less effective? 0:49:53 - Warning shots - their prevalence and character 0:59:17 - Feasibility of big changes to liability law 1:29:17 - Interactions with other areas of law 1:38:59 - How Gabriel encountered the AI x-risk field 1:42:41 - AI x-risk and the legal field 1:47:44 - Technical research to help with this proposal 1:50:47 - Decisions this proposal could influence 1:55:34 - Following Gabriel's research   The transcript: axrp.net/episode/2024/04/17/episode-28-tort-law-for-ai-risk-gabriel-weil.html   Links for Gabriel:  - SSRN page: papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1648032  - Twitter/X account: twitter.com/gabriel_weil   Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence: papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006   Other links:  - Foom liability: overcomingbias.com/p/foom-liability  - Punitive Damages: An Economic Analysis: law.harvard.edu/faculty/shavell/pdf/111_Harvard_Law_Rev_869.pdf  - Efficiency, Fairness, and the Externalization of Reasonable Risks: The Problem With the Learned Hand Formula: papers.ssrn.com/sol3/papers.cfm?abstract_id=4466197  - Tort Law Can Play an Important Role in Mitigating AI Risk: forum.effectivealtruism.org/posts/epKBmiyLpZWWFEYDb/tort-law-can-play-an-important-role-in-mitigating-ai-risk  - How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk: forum.effectivealtruism.org/posts/yWKaBdBygecE42hFZ/how-technical-ai-safety-researchers-can-help-implement  - Can the courts save us from dangerous AI? [Vox]: vox.com/future-perfect/2024/2/7/24062374/ai-openai-anthropic-deepmind-legal-liability-gabriel-weil   Episode art by Hamish Doodles: hamishdoodles.com
4/17/20241 hour, 57 minutes, 30 seconds
Episode Artwork

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:31 - What is AI control? 0:16:16 - Protocols for AI control 0:22:43 - Which AIs are controllable? 0:29:56 - Preventing dangerous coded AI communication 0:40:42 - Unpredictably uncontrollable AI 0:58:01 - What control looks like 1:08:45 - Is AI control evil? 1:24:42 - Can red teams match misaligned AI? 1:36:51 - How expensive is AI monitoring? 1:52:32 - AI control experiments 2:03:50 - GPT-4's aptitude at inserting backdoors 2:14:50 - How AI control relates to the AI safety field 2:39:25 - How AI control relates to previous Redwood Research work 2:49:16 - How people can work on AI control 2:54:07 - Following Buck and Ryan's research The transcript: axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html Links for Buck and Ryan:  - Buck's twitter/X account: twitter.com/bshlgrs  - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt  - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com Main research works we talk about:  - The case for ensuring that powerful AIs are controlled: lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled  - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942 Other things we mention:  - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root  - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512  - Improving the Welfare of AIs: A Nearcasted Proposal: lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal  - Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938  - Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing Episode art by Hamish Doodles: hamishdoodles.com
4/11/20242 hours, 56 minutes, 5 seconds
Episode Artwork

26 - AI Governance with Elizabeth Seger

The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:40 - What kinds of AI? 0:01:30 - Democratizing AI 0:04:44 - How people talk about democratizing AI 0:09:34 - Is democratizing AI important? 0:13:31 - Links between types of democratization 0:22:43 - Democratizing profits from AI 0:27:06 - Democratizing AI governance 0:29:45 - Normative underpinnings of democratization 0:44:19 - Open-sourcing AI 0:50:47 - Risks from open-sourcing 0:56:07 - Should we make AI too dangerous to open source? 1:00:33 - Offense-defense balance 1:03:13 - KataGo as a case study 1:09:03 - Openness for interpretability research 1:15:47 - Effectiveness of substitutes for open sourcing 1:20:49 - Offense-defense balance, part 2 1:29:49 - Making open-sourcing safer? 1:40:37 - AI governance research 1:41:05 - The state of the field 1:43:33 - Open questions 1:49:58 - Distinctive governance issues of x-risk 1:53:04 - Technical research to help governance 1:55:23 - Following Elizabeth's research The transcript: https://axrp.net/episode/2023/11/26/episode-26-ai-governance-elizabeth-seger.html Links for Elizabeth: Personal website: elizabethseger.com Centre for the Governance of AI (AKA GovAI): governance.ai Main papers: Democratizing AI: Multiple Meanings, Goals, and Methods: arxiv.org/abs/2303.12642 Open-sourcing highly capable foundation models: an evaluation of risks, benefits, and alternative methods for pursuing open source objectives: papers.ssrn.com/sol3/papers.cfm?abstract_id=4596436 Other research we discuss: What Do We Mean When We Talk About "AI democratisation"? (blog post): governance.ai/post/what-do-we-mean-when-we-talk-about-ai-democratisation Democratic Inputs to AI (OpenAI): openai.com/blog/democratic-inputs-to-ai Collective Constitutional AI: Aligning a Language Model with Public Input (Anthropic): anthropic.com/index/collective-constitutional-ai-aligning-a-language-model-with-public-input Against "Democratizing AI": johanneshimmelreich.net/papers/against-democratizing-AI.pdf Adversarial Policies Beat Superhuman Go AIs: goattack.far.ai Structured access: an emerging paradigm for safe AI deployment: arxiv.org/abs/2201.05159 Universal and Transferable Adversarial Attacks on Aligned Language Models (aka Adversarial Suffixes): arxiv.org/abs/2307.15043
11/26/20231 hour, 57 minutes, 13 seconds
Episode Artwork

25 - Cooperative AI with Caspar Oesterheld

Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I'll be speaking with Caspar Oesterheld about some of his research on this very topic. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:34 - Cooperative AI 0:06:21 - Cooperative AI vs standard game theory 0:19:45 - Do we need cooperative AI if we get alignment? 0:29:29 - Cooperative AI and agent foundations 0:34:59 - A Theory of Bounded Inductive Rationality 0:50:05 - Why it matters 0:53:55 - How the theory works 1:01:38 - Relationship to logical inductors 1:15:56 - How fast does it converge? 1:19:46 - Non-myopic bounded rational inductive agents? 1:24:25 - Relationship to game theory 1:30:39 - Safe Pareto Improvements 1:30:39 - What they try to solve 1:36:15 - Alternative solutions 1:40:46 - How safe Pareto improvements work 1:51:19 - Will players fight over which safe Pareto improvement to adopt? 2:06:02 - Relationship to program equilibrium 2:11:25 - Do safe Pareto improvements break themselves? 2:15:52 - Similarity-based Cooperation 2:23:07 - Are similarity-based cooperators overly cliqueish? 2:27:12 - Sensitivity to noise 2:29:41 - Training neural nets to do similarity-based cooperation 2:50:25 - FOCAL, Caspar's research lab 2:52:52 - How the papers all relate 2:57:49 - Relationship to functional decision theory 2:59:45 - Following Caspar's research The transcript: axrp.net/episode/2023/10/03/episode-25-cooperative-ai-caspar-oesterheld.html Links for Caspar: FOCAL at CMU: www.cs.cmu.edu/~focal/ Caspar on X, formerly known as Twitter: twitter.com/C_Oesterheld Caspar's blog: casparoesterheld.com/ Caspar on Google Scholar: scholar.google.com/citations?user=xeEcRjkAAAAJ&hl=en&oi=ao Research we discuss: A Theory of Bounded Inductive Rationality: arxiv.org/abs/2307.05068 Safe Pareto improvements for delegated game playing: link.springer.com/article/10.1007/s10458-022-09574-6 Similarity-based Cooperation: arxiv.org/abs/2211.14468 Logical Induction: arxiv.org/abs/1609.03543 Program Equilibrium: citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e1a060cda74e0e3493d0d81901a5a796158c8410 Formalizing Objections against Surrogate Goals: www.alignmentforum.org/posts/K4FrKRTrmyxrw5Dip/formalizing-objections-against-surrogate-goals Learning with Opponent-Learning Awareness: arxiv.org/abs/1709.04326
10/3/20233 hours, 2 minutes, 9 seconds
Episode Artwork

24 - Superalignment with Jan Leike

Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/ Topics we discuss, and timestamps: 0:00:37 - The superalignment team 0:02:10 - What's a human-level automated alignment researcher? 0:06:59 - The gap between human-level automated alignment researchers and superintelligence 0:18:39 - What does it do? 0:24:13 - Recursive self-improvement 0:26:14 - How to make the AI AI alignment researcher 0:30:09 - Scalable oversight 0:44:38 - Searching for bad behaviors and internals 0:54:14 - Deliberately training misaligned models 1:02:34 - Four year deadline 1:07:06 - What if it takes longer? 1:11:38 - The superalignment team and... 1:11:38 - ... governance 1:14:37 - ... other OpenAI teams 1:18:17 - ... other labs 1:26:10 - Superalignment team logistics 1:29:17 - Generalization 1:43:44 - Complementary research 1:48:29 - Why is Jan optimistic? 1:58:32 - Long-term agency in LLMs? 2:02:44 - Do LLMs understand alignment? 2:06:01 - Following Jan's research The transcript: axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html Links for Jan and OpenAI: OpenAI jobs: openai.com/careers Jan's substack: aligned.substack.com Jan's twitter: twitter.com/janleike Links to research and other writings we discuss: Introducing Superalignment: openai.com/blog/introducing-superalignment Let's Verify Step by Step (process-based feedback on math): arxiv.org/abs/2305.20050 Planning for AGI and beyond: openai.com/blog/planning-for-agi-and-beyond Self-critiquing models for assisting human evaluators: arxiv.org/abs/2206.05802 An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143 Language models can explain neurons in language models https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html Our approach to alignment research: openai.com/blog/our-approach-to-alignment-research Training language models to follow instructions with human feedback (aka the Instruct-GPT paper): arxiv.org/abs/2203.02155
7/27/20232 hours, 8 minutes, 29 seconds
Episode Artwork

23 - Mechanistic Anomaly Detection with Mark Xu

Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper "Formalizing the presumption of independence", which formulates the problem of formalizing heuristic mathematical reasoning, in the hope that this will let us mathematically define "mechanistic anomalies". Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/ Topics we discuss, and timestamps: 0:00:38 - Mechanistic anomaly detection 0:09:28 - Are all bad things mechanistic anomalies, and vice versa? 0:18:12 - Are responses to novel situations mechanistic anomalies? 0:39:19 - Formalizing "for the normal reason, for any reason" 1:05:22 - How useful is mechanistic anomaly detection? 1:12:38 - Formalizing the Presumption of Independence 1:20:05 - Heuristic arguments in physics 1:27:48 - Difficult domains for heuristic arguments 1:33:37 - Why not maximum entropy? 1:44:39 - Adversarial robustness for heuristic arguments 1:54:05 - Other approaches to defining mechanisms 1:57:20 - The research plan: progress and next steps 2:04:13 - Following ARC's research The transcript: axrp.net/episode/2023/07/24/episode-23-mechanistic-anomaly-detection-mark-xu.html ARC links: Website: alignment.org Theory blog: alignment.org/blog Hiring page: alignment.org/hiring Research we discuss: Formalizing the presumption of independence: arxiv.org/abs/2211.06738 Eliciting Latent Knowledge (aka ELK): alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge Mechanistic Anomaly Detection and ELK: alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk Can we efficiently explain model behaviours? alignmentforum.org/posts/dQvxMZkfgqGitWdkb/can-we-efficiently-explain-model-behaviors Can we efficiently distinguish different mechanisms? alignmentforum.org/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms
7/27/20232 hours, 5 minutes, 52 seconds
Episode Artwork

Survey, store closing, Patreon

Very brief survey: bit.ly/axrpsurvey2023 Store is closing in a week! Link: store.axrp.net/ Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast
6/28/20234 minutes, 26 seconds
Episode Artwork

22 - Shard Theory with Quintin Pope

What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe. Patreon: patreon.com/axrpodcast Store: store.axrp.net Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles Topics we discuss, and timestamps: 0:00:42 - Why understand human value formation? 0:19:59 - Why not design methods to align to arbitrary values? 0:27:22 - Postulates about human brains 0:36:20 - Sufficiency of the postulates 0:44:55 - Reinforcement learning as conditional sampling 0:48:05 - Compatibility with genetically-influenced behaviour 1:03:06 - Why deep learning is basically what the brain does 1:25:17 - Shard theory 1:38:49 - Shard theory vs expected utility optimizers 1:54:45 - What shard theory says about human values 2:05:47 - Does shard theory mean we're doomed? 2:18:54 - Will nice behaviour generalize? 2:33:48 - Does alignment generalize farther than capabilities? 2:42:03 - Are we at the end of machine learning history? 2:53:09 - Shard theory predictions 2:59:47 - The shard theory research community 3:13:45 - Why do shard theorists not work on replicating human childhoods? 3:25:53 - Following shardy research The transcript Shard theorist links: Quintin's LessWrong profile Alex Turner's LessWrong profile Shard theory Discord EleutherAI Discord Research we discuss: The Shard Theory Sequence Pretraining Language Models with Human Preferences Inner alignment in salt-starved rats Intro to Brain-like AGI Safety Sequence Brains and transformers: The neural architecture of language: Integrative modeling converges on predictive processing Brains and algorithms partially converge in natural language processing Evidence of a predictive coding hierarchy in the human brain listening to speech Singular learning theory explainer: Neural networks generalize because of this one weird trick Singular learning theory links Implicit Regularization via Neural Feature Alignment, aka circles in the parameter-function map The shard theory of human values Predicting inductive biases of pre-trained networks Understanding and controlling a maze-solving policy network, aka the cheese vector Quintin's Research agenda: Supervising AIs improving AIs Steering GPT-2-XL by adding an activation vector Links for the addendum on mesa-optimization skepticism: Quintin's response to Yudkowsky arguing against AIs being steerable by gradient descent Quintin on why evolution is not like AI training Evolution provides no evidence for the sharp left turn Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets
6/15/20233 hours, 28 minutes, 21 seconds
Episode Artwork

21 - Interpretability for Engineers with Stephen Casper

Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural networks. But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he's co-developed to evaluate whether interpretability tools can find 'Trojan horses' hidden inside neural nets. Patreon: patreon.com/axrpodcast Store: store.axrp.net Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 00:00:42 - Interpretability for engineers 00:00:42 - Why interpretability? 00:12:55 - Adversaries and interpretability 00:24:30 - Scaling interpretability 00:42:29 - Critiques of the AI safety interpretability community 00:56:10 - Deceptive alignment and interpretability 01:09:48 - Benchmarking Interpretability Tools (for Deep Neural Networks) (Using Trojan Discovery) 01:10:40 - Why Trojans? 01:14:53 - Which interpretability tools? 01:28:40 - Trojan generation 01:38:13 - Evaluation 01:46:07 - Interpretability for shaping policy 01:53:55 - Following Casper's work The transcript Links for Casper: Personal website Twitter Electronic mail: scasper [at] mit [dot] edu Research we discuss: The Engineer's Interpretability Sequence Benchmarking Interpretability Tools for Deep Neural Networks Adversarial Policies beat Superhuman Go AIs Adversarial Examples Are Not Bugs, They Are Features Planting Undetectable Backdoors in Machine Learning Models Softmax Linear Units Red-Teaming the Stable Diffusion Safety Filter Episode art by Hamish Doodles
5/2/20231 hour, 56 minutes, 2 seconds
Episode Artwork

20 - 'Reform' AI Alignment with Scott Aaronson

How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with Scott Aaronson about his views on how to make progress in AI alignment, as well as his work on watermarking the output of language models, and how he moved from a background in quantum complexity theory to working on AI. Note: this episode was recorded before this story emerged of a man committing suicide after discussions with a language-model-based chatbot, that included discussion of the possibility of him killing himself. Patreon: https://www.patreon.com/axrpodcast Store: https://store.axrp.net/ Ko-fi: https://ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:36 - 'Reform' AI alignment 0:01:52 - Epistemology of AI risk 0:20:08 - Immediate problems and existential risk 0:24:35 - Aligning deceitful AI 0:30:59 - Stories of AI doom 0:34:27 - Language models 0:43:08 - Democratic governance of AI 0:59:35 - What would change Scott's mind 1:14:45 - Watermarking language model outputs 1:41:41 - Watermark key secrecy and backdoor insertion 1:58:05 - Scott's transition to AI research 2:03:48 - Theoretical computer science and AI alignment 2:14:03 - AI alignment and formalizing philosophy 2:22:04 - How Scott finds AI research 2:24:53 - Following Scott's research The transcript Links to Scott's things: Personal website Book, Quantum Computing Since Democritus Blog, Shtetl-Optimized Writings we discuss: Reform AI Alignment Planting Undetectable Backdoors in Machine Learning Models
4/12/20232 hours, 27 minutes, 35 seconds
Episode Artwork

Store, Patreon, Video

Store: https://store.axrp.net/ Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Video: https://www.youtube.com/watch?v=kmPFjpEibu0
2/7/20232 minutes, 39 seconds
Episode Artwork

19 - Mechanistic Interpretability with Neel Nanda

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking. Topics we discuss, and timestamps: 00:01:05 - What is mechanistic interpretability? 00:24:16 - Types of AI cognition 00:54:27 - Automating mechanistic interpretability 01:11:57 - Summarizing the papers 01:24:43 - 'A Mathematical Framework for Transformer Circuits' 01:39:31 - How attention works 01:49:26 - Composing attention heads 01:59:42 - Induction heads 02:11:05 - 'In-context Learning and Induction Heads' 02:12:55 - The multiplicity of induction heads 02:30:10 - Lines of evidence 02:38:47 - Evolution in loss-space 02:46:19 - Mysteries of in-context learning 02:50:57 - 'Progress measures for grokking via mechanistic interpretability' 02:50:57 - How neural nets learn modular addition 03:11:37 - The suddenness of grokking 03:34:16 - Relation to other research 03:43:57 - Could mechanistic interpretability possibly work? 03:49:28 - Following Neel's research The transcript Links to Neel's things: Neel on Twitter Neel on the Alignment Forum Neel's mechanistic interpretability blog TransformerLens Concrete Steps to Get Started in Transformer Mechanistic Interpretability Neel on YouTube 200 Concrete Open Problems in Mechanistic Interpretability Comprehesive mechanistic interpretability explainer Writings we discuss: A Mathematical Framework for Transformer Circuits In-context Learning and Induction Heads Progress measures for grokking via mechanistic interpretability Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper") interpreting GPT: the logit lens Locating and Editing Factual Associations in GPT (aka the ROME paper) Human-level play in the game of Diplomacy by combining language models with strategic reasoning Causal Scrubbing An Interpretability Illusion for BERT Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models Collaboration & Credit Principles Transformer Feed-Forward Layers Are Key-Value Memories Multi-Component Learning and S-Curves The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Linear Mode Connectivity and the Lottery Ticket Hypothesis    
2/4/20233 hours, 52 minutes, 47 seconds
Episode Artwork

New podcast - The Filan Cabinet

I have a new podcast, where I interview whoever I want about whatever I want. It's called "The Filan Cabinet", and you can find it wherever you listen to podcasts. The first three episodes are about pandemic preparedness, God, and cryptocurrency. For more details, check out the podcast website, or search "The Filan Cabinet" in your podcast app.
10/13/20221 minute, 18 seconds
Episode Artwork

18 - Concept Extrapolation with Stuart Armstrong

Concept extrapolation is the idea of taking concepts an AI has about the world - say, "mass" or "does this picture contain a hot dog" - and extending them sensibly to situations where things are different - like learning that the world works via special relativity, or seeing a picture of a novel sausage-bread combination. For a while, Stuart Armstrong has been thinking about concept extrapolation and how it relates to AI alignment. In this episode, we discuss where his thoughts are at on this topic, what the relationship to AI alignment is, and what the open questions are. Topics we discuss, and timestamps: 00:00:44 - What is concept extrapolation 00:15:25 - When is concept extrapolation possible 00:30:44 - A toy formalism 00:37:25 - Uniqueness of extrapolations 00:48:34 - Unity of concept extrapolation methods 00:53:25 - Concept extrapolation and corrigibility 00:59:51 - Is concept extrapolation possible? 01:37:05 - Misunderstandings of Stuart's approach 01:44:13 - Following Stuart's work The transcript Stuart's startup, Aligned AI Research we discuss: The Concept Extrapolation sequence The HappyFaces benchmark Goal Misgeneralization in Deep Reinforcement Learning
9/3/20221 hour, 46 minutes, 19 seconds
Episode Artwork

17 - Training for Very High Reliability with Daniel Ziegler

Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually do this well, especially when we can't use a computer program to say what a 'failure' is? In this episode, I speak with Daniel Ziegler about his research group's efforts to try doing this with present-day language models, and what they learned. Listeners beware: this episode contains a spoiler for the Animorphs franchise around minute 41 (in the 'Fanfiction' section of the transcript). Topics we discuss, and timestamps: 00:00:40 - Summary of the paper 00:02:23 - Alignment as scalable oversight and catastrophe minimization 00:08:06 - Novel contribtions 00:14:20 - Evaluating adversarial robustness 00:20:26 - Adversary construction 00:35:14 - The task 00:38:23 - Fanfiction 00:42:15 - Estimators to reduce labelling burden 00:45:39 - Future work 00:50:12 - About Redwood Research The transcript Daniel Ziegler on Google Scholar Research we discuss: Daniel's paper, Adversarial Training for High-Stakes Reliability Low-stakes alignment Red Teaming Language Models with Language Models Uncertainty Estimation for Language Reward Models Eliciting Latent Knowledge
8/21/20221 hour, 59 seconds
Episode Artwork

16 - Preparing for Debate AI with Geoffrey Irving

Many people in the AI alignment space have heard of AI safety via debate - check out AXRP episode 6 if you need a primer. But how do we get language models to the stage where they can usefully implement debate? In this episode, I talk to Geoffrey Irving about the role of language models in AI safety, as well as three projects he's done that get us closer to making debate happen: using language models to find flaws in themselves, getting language models to back up claims they make with citations, and figuring out how uncertain language models should be about the quality of various answers. Topics we discuss, and timestamps: 00:00:48 - Status update on AI safety via debate 00:10:24 - Language models and AI safety 00:19:34 - Red teaming language models with language models 00:35:31 - GopherCite 00:49:10 - Uncertainty Estimation for Language Reward Models 01:00:26 - Following Geoffrey's work, and working with him The transcript Geoffrey's twitter Research we discuss: Red Teaming Language Models With Language Models Teaching Language Models to Support Answers with Verified Quotes, aka GopherCite Uncertainty Estimation for Language Reward Models AI Safety via Debate Writeup: progress on AI safety via debate Eliciting Latent Knowledge Training Compute-Optimal Large Language Models, aka Chinchilla
7/1/20221 hour, 4 minutes, 49 seconds
Episode Artwork

15 - Natural Abstractions with John Wentworth

Why does anybody care about natural abstractions? Do they somehow relate to math, or value learning? How do E. coli bacteria find sources of sugar? All these questions and more will be answered in this interview with John Wentworth, where we talk about his research plan of understanding agency via natural abstractions. Topics we discuss, and timestamps: 00:00:31 - Agency in E. Coli 00:04:59 - Agency in financial markets 00:08:44 - Inferring agency in real-world systems 00:16:11 - Selection theorems 00:20:22 - Abstraction and natural abstractions 00:32:42 - Information at a distance 00:39:20 - Why the natural abstraction hypothesis matters 00:44:48 - Unnatural abstractions used by humans? 00:49:11 - Probability, determinism, and abstraction 00:52:58 - Whence probabilities in deterministic universes? 01:02:37 - Abstraction and maximum entropy distributions 01:07:39 - Natural abstractions and impact 01:08:50 - Learning human values 01:20:47 - The shape of the research landscape 01:34:59 - Following John's work The transcript John on LessWrong Research that we discuss: Alignment by default - contains the natural abstraction hypothesis The telephone theorem Generalizing Koopman-Pitman-Darmois The plan Understanding deep learning requires rethinking generalization - deep learning can fit random data A closer look at memorization in deep networks - deep learning learns before memorizing Zero-shot coordination A new formalism, method, and open issues for zero-shot coordination Conservative agency via attainable utility preservation Corrigibility Errata: E. coli has ~4,400 genes, not 30,000. A typical adult human body has thousands of moles of water in it, and therefore must consist of well more than 10 moles total.
5/23/20221 hour, 36 minutes, 30 seconds
Episode Artwork

14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Late last year, Vanessa Kosoy and Alexander Appel published some research under the heading of "Infra-Bayesian physicalism". But wait - what was infra-Bayesianism again? Why should we care? And what does any of this have to do with physicalism? In this episode, I talk with Vanessa Kosoy about these questions, and get a technical overview of how infra-Bayesian physicalism works and what its implications are. Topics we discuss, and timestamps: 00:00:48 - The basics of infra-Bayes 00:08:32 - An invitation to infra-Bayes 00:11:23 - What is naturalized induction? 00:19:53 - How infra-Bayesian physicalism helps with naturalized induction 00:19:53 - Bridge rules 00:22:22 - Logical uncertainty 00:23:36 - Open source game theory 00:28:27 - Logical counterfactuals 00:30:55 - Self-improvement 00:32:40 - How infra-Bayesian physicalism works 00:32:47 - World models 00:39-20 - Priors 00:42:53 - Counterfactuals 00:50:34 - Anthropics 00:54:40 - Loss functions 00:56:44 - The monotonicity principle 01:01:57 - How to care about various things 01:08:47 - Decision theory 01:19:53 - Follow-up research 01:20:06 - Infra-Bayesian physicalist quantum mechanics 01:26:42 - Infra-Bayesian physicalist agreement theorems 01:29:00 - The production of infra-Bayesianism research 01:35:14 - Bridge rules and malign priors 01:45:27 - Following Vanessa's work The transcript Vanessa on the Alignment Forum Research that we discuss: Infra-Bayesian physicalism: a formal theory of naturalized induction Updating ambiguous beliefs (contains the infra-Bayesian update rule) Functional Decision Theory: A New Theory of Instrumental Rationality Space-time embedded intelligence Attacking the grain of truth problem using Bayes-Savage agents (generating a simplicity prior with Knightian uncertainty using oracle machines) Quantity of experience: brain-duplication and degrees of consciousness (the thick wires argument) Online learning in unknown Markov games Agreeing to disagree (contains the Aumann agreement theorem) What does the universal prior actually look like? (aka "the Solomonoff prior is malign") The Solomonoff prior is malign Eliciting Latent Knowledge ELK Thought Dump, by Abram Demski
4/5/20221 hour, 47 minutes, 31 seconds
Episode Artwork

13 - First Principles of AGI Safety with Richard Ngo

How should we think about artificial general intelligence (AGI), and the risks it might pose? What constraints exist on technical solutions to the problem of aligning superhuman AI systems with human intentions? In this episode, I talk to Richard Ngo about his report analyzing AGI safety from first principles, and recent conversations he had with Eliezer Yudkowsky about the difficulty of AI alignment. Topics we discuss, and timestamps: 00:00:40 - The nature of intelligence and AGI 00:01:18 - The nature of intelligence 00:06:09 - AGI: what and how 00:13:30 - Single vs collective AI minds 00:18:57 - AGI in practice 00:18:57 - Impact 00:20:49 - Timing 00:25:38 - Creation 00:28:45 - Risks and benefits 00:35:54 - Making AGI safe 00:35:54 - Robustness of the agency abstraction 00:43:15 - Pivotal acts 00:50:05 - AGI safety concepts 00:50:05 - Alignment 00:56:14 - Transparency 00:59:25 - Cooperation 01:01:40 - Optima and selection processes 01:13:33 - The AI alignment research community 01:13:33 - Updates from the Yudkowsky conversation 01:17:18 - Corrections to the community 01:23:57 - Why others don't join 01:26:38 - Richard Ngo as a researcher 01:28:26 - The world approaching AGI 01:30:41 - Following Richard's work The transcript Richard on the Alignment Forum Richard on Twitter The AGI Safety Fundamentals course Materials that we mention: AGI Safety from First Principles Conversations with Eliezer Yudkowsky The Bitter Lesson Metaphors We Live By The Enigma of Reason Draft report on AI timelines, by Ajeya Cotra More is Different for AI The Windfall Clause Cooperative Inverse Reinforcement Learning Imitative Generalisation Eliciting Latent Knowledge Draft report on existential risk from power-seeking AI, by Joseph Carlsmith The Most Important Century
3/31/20221 hour, 33 minutes, 53 seconds
Episode Artwork

12 - AI Existential Risk with Paul Christiano

Why would advanced AI systems pose an existential risk, and what would it look like to develop safer systems? In this episode, I interview Paul Christiano about his views of how AI could be so dangerous, what bad AI scenarios could look like, and what he thinks about various techniques to reduce this risk. Topics we discuss, and timestamps (due to mp3 compression, the timestamps may be tens of seconds off): 00:00:38 - How AI may pose an existential threat 00:13:36 - AI timelines 00:24:49 - Why we might build risky AI 00:33:58 - Takeoff speeds 00:51:33 - Why AI could have bad motivations 00:56:33 - Lessons from our current world 01:08:23 - "Superintelligence" 01:15:21 - Technical causes of AI x-risk 01:19:32 - Intent alignment 01:33:52 - Outer and inner alignment 01:43:45 - Thoughts on agent foundations 01:49:35 - Possible technical solutions to AI x-risk 01:49:35 - Imitation learning, inverse reinforcement learning, and ease of evaluation 02:00:34 - Paul's favorite outer alignment solutions 02:01:20 - Solutions researched by others 02:06:13 - Decoupling planning from knowledge 02:17:18 - Factored cognition 02:25:34 - Possible solutions to inner alignment 02:31:56 - About Paul 02:31:56 - Paul's research style 02:36:36 - Disagreements and uncertainties 02:46:08 - Some favorite organizations 02:48:21 - Following Paul's work The transcript Paul's blog posts on AI alignment Material that we mention: Cold Takes - The Most Important Century Open Philanthropy reports on: Modeling the human trajectory The computational power of the human brain AI timelines (draft) Whether AI could drive explosive economic growth Takeoff speeds Superintelligence: Paths, Dangers, Strategies Wei Dai on metaphilosophical competence: Two neglected problems in human-AI safety The argument from philosophical difficulty Some thoughts on metaphilosophy AI safety via debate Iterated distillation and amplification Scalable agent alignment via reward modeling: a research direction Learning the prior Imitative generalisation (AKA 'learning the prior') When is unaligned AI morally valuable?
12/2/20212 hours, 49 minutes, 36 seconds
Episode Artwork

11 - Attainable Utility and Power with Alex Turner

Many scary stories about AI involve an AI system deceiving and subjugating humans in order to gain the ability to achieve its goals without us stopping it. This episode's guest, Alex Turner, will tell us about his research analyzing the notions of "attainable utility" and "power" that underlie these stories, so that we can better evaluate how likely they are and how to prevent them. Topics we discuss: Side effects minimization Attainable Utility Preservation (AUP) AUP and alignment Power-seeking Power-seeking and alignment Future work and about Alex The transcript Alex on the AI Alignment Forum Alex's Google Scholar page Conservative Agency via Attainable Utility Preservation Optimal Policies Tend to Seek Power Other works discussed: Avoiding Side Effects by Considering Future Tasks The "Reframing Impact" Sequence The "Risks from Learned Optimization" Sequence Concrete Approval-Directed Agents Seeking Power is Convergently Instrumental in a Broad Class of Environments Formalizing Convergent Instrumental Goals The More Power at Stake, the Stronger Instumental Convergence Gets for Optimal Policies Problem Relaxation as a Tactic How I do Research Math that Clicks: Look for Two-way Correspondences Testing the Natural Abstraction Hypothesis
9/25/20211 hour, 27 minutes, 36 seconds
Episode Artwork

10 - AI's Future and Impacts with Katja Grace

When going about trying to ensure that AI does not cause an existential catastrophe, it's likely important to understand how AI will develop in the future, and why exactly it might or might not cause such a catastrophe. In this episode, I interview Katja Grace, researcher at AI Impacts, who's done work surveying AI researchers about when they expect superhuman AI to be reached, collecting data about how rapidly AI tends to progress, and thinking about the weak points in arguments that AI could be catastrophic for humanity. Topics we discuss: 00:00:34 - AI Impacts and its research 00:08:59 - How to forecast the future of AI 00:13:33 - Results of surveying AI researchers 00:30:41 - Work related to forecasting AI takeoff speeds 00:31:11 - How long it takes AI to cross the human skill range 00:42:47 - How often technologies have discontinuous progress 00:50:06 - Arguments for and against fast takeoff of AI 01:04:00 - Coherence arguments 01:12:15 - Arguments that AI might cause existential catastrophe, and counter-arguments 01:13:58 - The size of the super-human range of intelligence 01:17:22 - The dangers of agentic AI 01:25:45 - The difficulty of human-compatible goals 01:33:54 - The possibility of AI destroying everything 01:49:42 - The future of AI Impacts 01:52:17 - AI Impacts vs academia 02:00:25 - What AI x-risk researchers do wrong 02:01:43 - How to follow Katja's and AI Impacts' work The transcript "When Will AI Exceed Human Performance? Evidence from AI Experts" AI Impacts page of more complete survey results Likelihood of discontinuous progress around the development of AGI Discontinuous progress investigation The range of human intelligence
7/23/20212 hours, 2 minutes, 58 seconds
Episode Artwork

9 - Finite Factored Sets with Scott Garrabrant

Being an agent can get loopy quickly. For instance, imagine that we're playing chess and I'm trying to decide what move to make. Your next move influences the outcome of the game, and my guess of that influences my move, which influences your next move, which influences the outcome of the game. How can we model these dependencies in a general way, without baking in primitive notions of 'belief' or 'agency'? Today, I talk with Scott Garrabrant about his recent work on finite factored sets that aims to answer this question. Topics we discuss: 00:00:43 - finite factored sets' relation to Pearlian causality and abstraction 00:16:00 - partitions and factors in finite factored sets 00:26:45 - orthogonality and time in finite factored sets 00:34:49 - using finite factored sets 00:37:53 - why not infinite factored sets? 00:45:28 - limits of, and follow-up work on, finite factored sets 01:00:59 - relevance to embedded agency and x-risk 01:10:40 - how Scott researches 01:28:34 - relation to Cartesian frames 01:37:36 - how to follow Scott's work Link to the transcript Link to a transcript of Scott's talk on finite factored sets Scott's LessWrong account Other work mentioned in the discussion: Causality, by Judea Pearl Scott's work on Cartesian frames
6/24/20211 hour, 38 minutes, 59 seconds
Episode Artwork

8 - Assistance Games with Dylan Hadfield-Menell

How should we think about the technical problem of building smarter-than-human AI that does what we want? When and how should AI systems defer to us? Should they have their own goals, and how should those goals be managed? In this episode, Dylan Hadfield-Menell talks about his work on assistance games that formalizes these questions. The first couple years of my PhD program included many long conversations with Dylan that helped shape how I view AI x-risk research, so it was great to have another one in the form of a recorded interview. Link to the transcript Link to the paper "Cooperative Inverse Reinforcement Learning" Link to the paper "The Off-Switch Game" Link to the paper "Inverse Reward Design" Dylan's twitter account Link to apply to the MIT EECS graduate program Other work mentioned in the discussion: The original paper on inverse optimal control Justin Fu's research on, among other things, adversarial IRL Preferences implicit in the state of the world What are you optimizing for? Aligning recommender systems with human values The Assistive Multi-Armed Bandit Soares et al. on Corrigibility Should Robots be Obedient? Rodney Brooks on the Seven Deadly Sins of Predicting the Future of AI Products in category theory AXRP Episode 7 - Side Effects with Victoria Krakovna Attainable Utility Preservation Penalizing side effects using stepwise relative reachability Simplifying Reward Design through Divide-and-Conquer Active Inverse Reward Design An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning Incomplete Contracting and AI Alignment Multi-Principal Assistance Games Consequences of Misaligned AI
6/8/20212 hours, 23 minutes, 17 seconds
Episode Artwork

7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra

If you want to shape the development and forecast the consequences of powerful AI technology, it's important to know when it might appear. In this episode, I talk to Ajeya Cotra about her draft report "Forecasting Transformative AI from Biological Anchors" which aims to build a probabilistic model to answer this question. We talk about a variety of topics, including the structure of the model, what the most important parts are to get right, how the estimates should shape our behaviour, and Ajeya's current work at Open Philanthropy and perspective on the AI x-risk landscape. Unfortunately, there was a problem with the recording of our interview, so we weren't able to release it in audio form, but you can read a transcript of the whole conversation. Link to the transcript Link to the draft report "Forecasting Transformative AI from Biological Anchors"
5/28/20211 minute, 3 seconds
Episode Artwork

7 - Side Effects with Victoria Krakovna

One way of thinking about how AI might pose an existential threat is by taking drastic actions to maximize its achievement of some objective function, such as taking control of the power supply or the world's computers. This might suggest a mitigation strategy of minimizing the degree to which AI systems have large effects on the world that are not absolutely necessary for achieving their objective. In this episode, Victoria Krakovna talks about her research on quantifying and minimizing side effects. Topics discussed include how one goes about defining side effects and the difficulties in doing so, her work using relative reachability and the ability to achieve future tasks as side effects measures, and what she thinks the open problems and difficulties are. Link to the transcript Link to the paper "Penalizing Side Effects Using Stepwise Relative Reachability" Link to the paper "Avoiding Side Effects by Considering Future Tasks" Victoria Krakovna's website Victoria Krakovna's Alignment Forum profile Work mentioned in the episode: Rohin Shah on the difficulty of finding a value-agnostic impact measure Stuart Armstrong's bucket of water example Attainable Utility Preservation Low Impact Artificial Intelligences AI Safety Gridworlds Test Cases for Impact Regularisation Methods SafeLife Avoiding Side Effects in Complex Environments
5/14/20211 hour, 19 minutes, 29 seconds
Episode Artwork

6 - Debate and Imitative Generalization with Beth Barnes

One proposal to train AIs that can be useful is to have ML models debate each other about the answer to a human-provided question, where the human judges which side has won. In this episode, I talk with Beth Barnes about her thoughts on the pros and cons of this strategy, what she learned from seeing how humans behaved in debate protocols, and how a technique called imitative generalization can augment debate. Those who are already quite familiar with the basic proposal might want to skip past the explanation of debate to 13:00, "what problems does it solve and does it not solve". Link to Beth's posts on the Alignment Forum Link to the transcript
4/8/20211 hour, 58 minutes, 48 seconds
Episode Artwork

5 - Infra-Bayesianism with Vanessa Kosoy

The theory of sequential decision-making has a problem: how can we deal with situations where we have some hypotheses about the environment we're acting in, but its exact form might be outside the range of possibilities we can possibly consider? Relatedly, how do we deal with situations where the environment can simulate what we'll do in the future, and put us in better or worse situations now depending on what we'll do then? Today's episode features Vanessa Kosoy talking about infra-Bayesianism, the mathematical framework she developed with Alex Appel that modifies Bayesian decision theory to succeed in these types of situations. Link to the sequence of posts - Infra-Bayesianism Link to the transcript Vanessa Kosoy's Alignment Forum profile
3/10/20211 hour, 23 minutes, 51 seconds
Episode Artwork

4 - Risks from Learned Optimization with Evan Hubinger

In machine learning, typically optimization is done to produce a model that performs well according to some metric. Today's episode features Evan Hubinger talking about what happens when the learned model itself is doing optimization in order to perform well, how the goals of the learned model could differ from the goals we used to select the learned model, and what would happen if they did differ. Link to the paper - Risks from Learned Optimization in Advanced Machine Learning Systems Link to the transcript Evan Hubinger's Alignment Forum profile
2/17/20212 hours, 13 minutes, 32 seconds
Episode Artwork

3 - Negotiable Reinforcement Learning with Andrew Critch

In this episode, I talk with Andrew Critch about negotiable reinforcement learning: what happens when two people (or organizations, or what have you) who have different beliefs and preferences jointly build some agent that will take actions in the real world. In the paper we discuss, it's proven that the only way to make such an agent Pareto optimal - that is, have it not be the case that there's a different agent that both people would prefer to use instead - is to have it preferentially optimize the preferences of whoever's beliefs were more accurate. We discuss his motivations for working on the problem and what he thinks about it. Link to the paper - Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making Link to the transcript Critch's Google Scholar profile
12/11/202058 minutes, 14 seconds
Episode Artwork

2 - Learning Human Biases with Rohin Shah

One approach to creating useful AI systems is to watch humans doing a task, infer what they're trying to do, and then try to do that well. The simplest way to infer what the humans are trying to do is to assume there's one goal that they share, and that they're optimally achieving the goal. This has the problem that humans aren't actually optimal at achieving the goals they pursue. We could instead code in the exact way in which humans behave suboptimally, except that we don't know that either. In this episode, I talk with Rohin Shah about his paper about learning the ways in which humans are suboptimal at the same time as learning what goals they pursue: why it's hard, how he tried to do it, how well he did, and why it matters. Link to the paper - On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference Link to the transcript The Alignment Newsletter Rohin's contributions to the AI alignment forum Rohin's website
12/11/20201 hour, 8 minutes, 51 seconds
Episode Artwork

1 - Adversarial Policies with Adam Gleave

In this episode, Adam Gleave and I talk about adversarial policies. Basically, in current reinforcement learning, people train agents that act in some kind of environment, sometimes an environment that contains other agents. For instance, you might train agents that play sumo with each other, with the objective of making them generally good at sumo. Adam's research looks at the case where all you're trying to do is make an agent that defeats one specific other agents: how easy is it, and what happens? He discovers that often, you can do it pretty easily, and your agent can behave in a very silly-seeming way that nevertheless happens to exploit some 'bug' in the opponent. We talk about the experiments he ran, the results, and what they say about how we do reinforcement learning. Link to the paper - Adversarial Policies: Attacking Deep Reinforcement Learning Link to the transcript Adam's website Adam's twitter account
12/11/202058 minutes, 41 seconds