Some AI systems can directly interact with the outside (either physical or virtual) world without human users in between. Such a sensorimotor mechanism is also a necessary front-end of a language interface.
Perception forms various levels of abstraction from sensation, and integrate them with the knowledge of the system, so as to guide the system's actions to achieve the goals in a changing environment.
AI research on perception has been focused on vision and sound. The major challenge is to choose proper features for each level of abstraction. Initially, the features are selected by the designers for each level. In computer vision, an influential approach was proposed by David Marr.
Deep learning fundamentally changes to the approach of "feature learning" where the features are generated and selected by a learning algorithm according to their contribution to the overall task. In a Convolutionary Neural Network (CNN), convolution kernels are applied to generate feature maps that are further abstracted by the next layers. Trained end-to-end using backpropagation, CNN works well on ImageNet data in recognizing patterns.
A brain-inspired approach of vision is Hierarchical Temporal Memory (HTM), which combines ideas including sparse distributed memory, Bayesian networks, and spatio-temporal clustering.
In the processing of spoken language, deep learning has also greatly improved the quality of speech-text mapping in both directions (Speech recognition and Speech synthesis), which provides one more stage in NLP.
AI has also been experimented in music and art, both in perception and composition/creation. Recently, AI-Generated Content (AIGC) has attracted interests and attentions.
From AI's perspective, the key issue in robots and agents is not their programmed or controlled behavior, but their learned behavior, especially in changing environments. Various ideas: