Perception presumes sensation, where various types of sensors each converts a certain type of simple signal into data of the system. To put the data together and to make sense out of them is the job of the perception mechanism.
Perception can be seen as a special type of categorization (or classification, pattern recognition) where the inputs are sensory data, and the outputs are categorical judgments and conceptual relations.
The difficulty of the task comes from the need of multiple levels of abstraction, where the relations among data items are many-to-many, uncertain, and changing over time.
Accurately speaking, we never "see things as they are", and perception process of an intelligent system is often (and should be) influenced by internal and external factors beside the signals themselves. Furthermore, perception is not a pure passive process driven by the input.
In AI, the study on perception is mostly focused on the reproduction of human perception, especially on the perception of aural and visual signals. However, this is not necessarily the case since the perception mechanism of a computer system does not have to be identical to that of a human being.
There are several approaches toward speech recognition:
The acoustic-phonetic approach postulates that there exist finite, distinctive phonetic units (phonemes) in spoken language and that these units are broadly characterized by a set of acoustic properties. Even though the acoustic properties of phonetic units are highly variable, both with speakers and with neighboring sounds, it is assumed in the acoustic-phonetic approach that the rules governing the variability are straightforward and can be readily learned.
The pattern-matching approach represent a speech-pattern in the form of a mathematical model. A direct comparison is made between the unknown speech (the speech to be recognized) with each possible pattern learned in the training stage in order to determine the identity of the unknown.
The artificial intelligence approach attempts to do speech recognition using various AI techniques, such as knowledge-based systems or neural networks.
A related topic is speech synthesis, that is, translation from text to speech. After the text analysis capabilities pre-process the text (digit sequences, abbreviations, etc.) the pronunciations of most ordinary words and proper names are decided by the dictionary-based methods. Finally there are methods responsible for post-processing (prosodic phrasing, word accentuation, sentence intonation) and the actual speech synthesis. Here is an on-line demo. A major remaining problem is naturalness, especially context and meaning related adjustments (emotion, stress, tone, ...). To fully solve this problem, it is probably necessary to fully understand the meaning of the message and the purpose of the speech.
Music perception and composition are also studied in AI. For example, there are music works produced by a computer program, and some of them are in the styles of various classical composers.
Computational vision studies often follow three primary stages:
Vision is not a pure input process. Eye movement has important impact on human visual perception. An active vision system is one that is able to interact with its environment by altering its viewpoint rather than passively observing it, and by operating on sequences of images rather than on a single frame. Also, there is some study on using the eye-gaze of a computer user in the interface to aid the control of the application.
"One of the most important properties of high-level perception is that it is extremely flexible. A given set of input data may be perceived in a number of different ways, depending on the context and the state of the perceiver. Due to this flexibility, it is a mistake to regard perception as a process that associates a fixed representation with a particular situation. Both contextual factors and top-down cognitive influences make the process far less rigid than this." [more on this topic].