Discussion about this post

User's avatar
David Rein's avatar

Thanks for the post! One objection I think is potentially important is with regards to the relative rate of improvement in alignment versus other capabilities. While I agree that we'll be able to use protocols like Debate/IDA/RRM to help us align AI that is helping with alignment work, my concern is that the alignment work will "lag" behind the capabilities. If alignment is always lagging capabilities, then once your system is powerful enough, you won't be able to control it well. Curious how you think about the relative rate of progress in alignment vs. capabilities.

Expand full comment
Roman V. Yampolskiy's avatar

Hey Jan,

My Uncontrollability paper is long and addresses 4 different types of control. “Disobey” applies only to Direct control (giving orders), which is not Alignment and everyone agrees it will not work, so I don’t think we disagree on this point.

The paper also explicitly says, in regards to the Rice’s theorem, that “AI safety researchers [36] correctly argue that we do not have to deal with an arbitrary AI, as if gifted to us by aliens, but rather we can design a particular AI with the safety properties we want.” So once again I think we are in agreement.

I also read your blogpost on formal verification and have a published paper on some of the challenges you are describing: https://iopscience.iop.org/article/10.1088/1402-4896/aa7ca8/meta It looks to me like we are looking at very similar initial conditions, correctly identifying numerous challenges, but for some reason arrive at very different predictions regarding our ability to solve all such problems (see https://dl.acm.org/doi/10.1145/3603371 for a recent survey), especially in the next 4 years.

I honestly hope I am wrong, and you are right, but so far, I am struggling to find any evidence of sufficient progress.

Best,

Roman

Expand full comment
23 more comments...

No posts

OSZAR »