Multi-domain operations, the future operational concepts of the Army, require autonomous agents with learning components to act alongside the combatant war. New Army research reduces the unpredictability of learning policies to strengthen current training to make them more applicable to physical systems, especially ground robots.
These learning components will allow autonomous agents to reason and adapt to the conditions of the battlefield, said Army researcher Alec Koppel, a U.S. Army Combat Skills Development Command (now DEVCOM, Army Research Laboratory).
The mechanism of adaptation and re-planning is to strengthen learning-based policies. Making these policies effectively available is key to making the MDO operation concept a reality, he said.
According to Koppel, policy gradient methods in reinforcement learning are the basis of scalable algorithms for continuous spaces, but existing techniques cannot incorporate broader decision-making goals such as risk sensitivity, security restrictions, exploration, and advance.
Designing autonomous behaviors when the relationship between dynamics and goals is complex can be addressed with reinforcement learning, which has recently gained attention to solving previously unsolvable tasks such as go, chess, and video games like Atari and Starcraft II, Koppel said.
Major practice, unfortunately, requires astronomical complexity of the sample, such as a simulated game of thousands of years, he said. Thanks to this complexity of the sample, the MDO contexts are applicable to the data-hungry settings required for the Next Generation Combat Vehicle or NGCV.
“To facilitate reinforcement learning in MDO and NGCV, training mechanisms need to improve sample efficiency and reliability in continuous spaces,” Koppel said. “By generalizing existing policy-seeking schemes to general utilities, we are taking a step towards strengthening barriers to the effectiveness of the existing sample of current practices in learning.”
Koppel and his research team developed new general policy search schemes for general utilities, and the complexity of their samples is also well established. They found that the resulting policy-seeking schemes reduce the volatility of reward accumulation, provide an effective exploration of unknown domains, and provide a mechanism for incorporating prior experience.
“This research helps to reinforce the classical Policy Gradient Theorem in reinforcing learning,” Koppel said. “It introduces new schemes for the search for policies for general use, and the complexity of their sampling is also established. “
In particular, in the context of terrestrial robots, he said it is expensive to obtain data.
“Reducing the volatility of reward accumulation by ensuring that an unknown domain is effectively explored, or by incorporating prior experience, all help reinforce barriers to the effectiveness of the current practice sample in learning to alleviate the required random sampling.
The future of this research is very clear, and Koppel has devoted his efforts to making his findings applicable to innovative technologies on the battlefield for soldiers.
“I am optimistic that autonomous robots equipped to learn reinforcements will be able to assist the war fighter in their exploration, reconnaissance and risk assessment on the future battlefield,” Koppel said. “Making that vision a reality is essential to what efforts I focus on when driven by research issues.”
The next step in this research is to incorporate the broader goals of general utility-enabled decision-making into reinforcement learning in multi-agent settings, and to investigate how interactive settings between reinforcement learning agents create synergistic and antagonistic reasoning between groups.
According to Koppel, the technology that is the result of this research will be able to reason in uncertainty in group scenarios.
Reference; “A Gradient Method of a Variable Policy for Enhancing Learning with General Use” by Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. NeurIPS Procedures.
This research is conducted in collaboration Princeton University, The University of Alberta and Google Deepmind, was the keynote speaker at NeurIPS 2020, one of the first conferences to promote the exchange of research on neural information processing systems in biological, technological, mathematical, and theoretical aspects.