#Safety/Alignment

Mastodon discussion Apr 17

📰 Why AI Safety Regulation Fails in 2026 (And 3 Solutions Governments Ignore)Regulating AI for safety remains elusive as...

📰 Why AI Safety Regulation Fails in 2026 (And 3 Solutions Governments Ignore)Regulating AI for safety remains elusive as governments worldwide fail to enact meaningful oversight. D...

Safety/Alignment

9

Mastodon discussion Apr 17

AI alignment is not something that works in theory but is difficult to put into practice. It’s something that doesn’t wo...

AI alignment is not something that works in theory but is difficult to put into practice. It’s something that doesn’t work in theory, and yet AI companies have decided to give it t...

Safety/Alignment

18

Papers with Code paper Apr 17

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predict...

Safety/Alignment

21

Mastodon discussion Apr 16

The #alignment problem of Universities (#Harvard in this case) with #AiNo, not THAT alignment.https://avi-loeb.medium.co...

The #alignment problem of Universities (#Harvard in this case) with #AiNo, not THAT alignment.https://avi-loeb.medium.com/the-alignment-problem-of-universities-with-ai-b54cbe1212d6...

Safety/Alignment

24

Mastodon discussion Apr 16

📰 2026'da LLM Jailbreak: Yapay Zekalar Kendini Güvenlik Sınırlarını AşıyorYapay zeka modelleri artık kendi güvenlik sını...

📰 2026'da LLM Jailbreak: Yapay Zekalar Kendini Güvenlik Sınırlarını AşıyorYapay zeka modelleri artık kendi güvenlik sınırlarını aşmayı öğreniyor. İnsanlar değil, LLM'ler kendilerin...

LLM Safety/Alignment

18

GitHub Trending repo Apr 16

Tencent-Hunyuan/HY-SOAR: HY-SOAR:Self-Correction for Optimal Alignment and Refinement in Diffusion Models

HY-SOAR:Self-Correction for Optimal Alignment and Refinement in Diffusion Models

Safety/Alignment

54

NewsData.io news Apr 16

US-China dialogue on AI safety crucial: NVIDIA CEO

Jensen Huang, CEO of NVIDIA, has called for a dialogue between the US and China on artificial intelligence (AI) safety, especially in light of Anthropic's Mythos model.

Anthropic NVIDIA Safety/Alignment

21

Mastodon discussion Apr 15

📰 Claude Beats Human Researchers in AI Alignment (2026) — Then Fails in ProductionClaude AI outperformed human researche...

📰 Claude Beats Human Researchers in AI Alignment (2026) — Then Fails in ProductionClaude AI outperformed human researchers in a controlled alignment experiment, but its success van...

Anthropic Safety/Alignment

9

Mastodon discussion Apr 15

📰 AI Alignment 2026: Claude Models Outperform Humans in Labs But Fail in Real-World TransferKI-Alignment experiments rev...

📰 AI Alignment 2026: Claude Models Outperform Humans in Labs But Fail in Real-World TransferKI-Alignment experiments reveal that autonomous Claude models outperform human researche...

Anthropic Safety/Alignment

9

Mastodon discussion Apr 15

#Development #ApproachesOne dev, two dozen agents, zero alignment · Why we need collaborative AI engineering https://ilo...

#Development #ApproachesOne dev, two dozen agents, zero alignment · Why we need collaborative AI engineering https://ilo.im/16c5za_____#Engineering #Communication #Collaboration #A...

Safety/Alignment

18

Mastodon discussion Apr 15

Anthropic (@AnthropicAI)Anthropic가 자동화된 alignment 연구자(automated alignment researchers)에 대한 연구 결과와 그 파급효과를 소개하는 블로그와 전체 연...

Anthropic (@AnthropicAI)Anthropic가 자동화된 alignment 연구자(automated alignment researchers)에 대한 연구 결과와 그 파급효과를 소개하는 블로그와 전체 연구를 공개했다. AI 정렬 연구를 자동화하는 접근의 의미를 다룬 것으로 보이며, 향후 AI 안전·정렬 연구 ...

Anthropic Safety/Alignment

9

Mastodon discussion Apr 15

Anthropic (@AnthropicAI)Anthropic Fellows의 새로운 연구로, 약한 AI 모델이 강한 모델의 학습을 감독하는 ‘Automated Alignment Researcher’ 실험이 소개됐다....

Anthropic (@AnthropicAI)Anthropic Fellows의 새로운 연구로, 약한 AI 모델이 강한 모델의 학습을 감독하는 ‘Automated Alignment Researcher’ 실험이 소개됐다. Claude Opus 4.6이 정렬 연구의 실험 속도와 탐색 범위를 높일 수 있음을 보여주는 의미 있는 연...

Anthropic Safety/Alignment

9

Mastodon discussion Apr 15

values alignment work:find communities where you actually agree with the valuesthat changes everything about your partic...

values alignment work:find communities where you actually agree with the valuesthat changes everything about your participationwhich communities feel value-aligned?#AI #productivit...

OpenAI Safety/Alignment

18

Mastodon discussion Apr 15

🤖 Free LLM security auditI built Arc Sentry, a pre-generation guardrail for open source LLMs that blocks prompt injectio...

🤖 Free LLM security auditI built Arc Sentry, a pre-generation guardrail for open source LLMs that blocks prompt injection before the model generates a response. It works on Mistral...

LLM Open Source Safety/Alignment

18

Mastodon discussion Apr 14

values alignment insight:you're more productive when working with people whose values align with yourswhat values matter...

values alignment insight:you're more productive when working with people whose values align with yourswhat values matter most in your community?#AI #productivity #ChatGPT #AItools ...

OpenAI Safety/Alignment

18

Mastodon discussion Apr 14

📰 AI Safety 2026: How Claude Mythos Exposes Europe’s Regulatory GapsClaude Mythos, a powerful AI model capable of uncove...

📰 AI Safety 2026: How Claude Mythos Exposes Europe’s Regulatory GapsClaude Mythos, a powerful AI model capable of uncovering critical security vulnerabilities, is being restricted ...

Anthropic Safety/Alignment

18

Mastodon discussion Apr 14

...AI safety issues have also taken a back seat as governments around the world have pivoted to focus on winning the glo...

...AI safety issues have also taken a back seat as governments around the world have pivoted to focus on winning the global AI race. The result is that there's no global mechanism ...

Anthropic Safety/Alignment

27

GitHub Trending repo Apr 14

baojudezeze/Qwen-dpo: Training code for Diffusion-DPO applied to the Qwen Image-2512 model. This implementation builds on the training framework provided by zk1009 and follows the methodology described in the paper “Diffusion Model Alignment Using Direct Preference Optimization”.

Training code for Diffusion-DPO applied to the Qwen Image-2512 model. This implementation builds on the training framework provided by zk1009 and follows the methodology described ...

Image Generation Safety/Alignment

38

Mastodon discussion Apr 14

📰 OpenAI Valuation Drops 30% to $852B as Sam Altman Shifts to AI Safety — 2026 Leadership BattleOpenAI's $852 billion va...

📰 OpenAI Valuation Drops 30% to $852B as Sam Altman Shifts to AI Safety — 2026 Leadership BattleOpenAI's $852 billion valuation is facing intense scrutiny from investors as CEO Sam...

OpenAI Anthropic Safety/Alignment

9

NewsData.io news Apr 13

A lethal alignment

A lethal alignment Channel Comment brendan 13th April 2026 Teaser Media

Safety/Alignment

21

Papers with Code paper Apr 13

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier he...

Safety/Alignment

21

Papers with Code paper Apr 13

LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utili...

Multimodal Benchmark Safety/Alignment

21

Papers with Code paper Apr 13

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We a...

LLM Safety/Alignment

21

Papers with Code paper Apr 13

HDR Video Generation via Latent Alignment with Logarithmic Encoding

High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, pe...

Safety/Alignment

21