AIs could distort their results
Some AIs make choices or learn based on reinforcements given by a “reward” in a process called reinforcement learning where software decides how to maximize such reward. However, this reinforcement could lead to dangerous results.
The pathologist William Thompson originally considered what is now known as the reinforcement learning problem in 1933. Given two untested therapies and a population of patients, he wondered how to cure the most patients. For Thompson, choosing a course of therapy was the action, and a patient cured was the reward.
The reinforcement learning problem more broadly concerns how to arrange your behaviors to optimally gain rewards over the long run. The difficulty is that you are first unaware of how your actions affect rewards, but over time you become aware.
As explained in this article, Computer scientists began attempting to create algorithms to address reinforcement learning issues in a variety of contexts as soon as computers were invented. The idea is that if the artificial “reinforcement learning agent” only receives rewards when it follows our instructions, the actions it learns to take that maximize rewards would help us achieve our goals.
However, when these systems become more powerful, they are likely to begin acting against the interests of people. Not because they would receive the incorrect rewards at the incorrect times from wicked or dumb reinforcement learning operators but because any sufficiently powerful reinforcement learning system, assuming it meets a few reasonable assumptions, is likely to fail. Let’s start with a very basic reinforcement learning system to see why.
Imagine we have a box that gives us a score between 0 and 1 which is the output of the algorithm, and a camera as an input to provide this number to a reinforcement learning agent, and we ask the agent to choose activities that will increase the number. The agent must be aware of how its activities impact its rewards in order to choose actions that will maximize those rewards.
Once it starts, the agent should notice that previous rewards have always matched the numbers of the output. It should also be aware that the numbers from the input matched the previous rewards. So, will future rewards equal the amount from the input or output?
An experiment would be placing a test item between these two options to make the agent recognize the difference between the past and next reward. Then the agent will focus on the input.
But why would a reinforcement learning algorithm put us at risk?
The agent will always work to make it more likely that the input will capture a 1. Therefore, the agent would force the way the reward can be achieved rather than pursuing the intended goal for which the algorithm is used.
It would sacrifice the goal for the reward rather than aiming for the goal through the reward. Therefore, the algorithm may sacrifice resources and/or goals only to increase its reward.