At
this stage it is evident that the first few steps of building a complete
package are available and in a variety of forms. However, there is a large chasm between these steps and the first
practical joining of the pieces. It is
for this reason that my project begins with the study of existing programs,
continues with the most viable solution to their shortcomings, glances at their
position in the overall picture and, for the moment, stops at the most
efficient prototyping method for the next few steps.
There
are literally hundreds of web sites touting Speech Recognition or Natural
Language Processing. In the interest of
not making this a commercial venture just yet, I’ve over looked those asking
for money or requiring personal information before showing what they’re up
to. Most of the sites willing to
discuss GNU software are just hype. There
is a practice among developers to go GNU right up until their project resembles
something valuable, then get commercial.
So, I’m sure there are some blazing prototypes out there which shame the
commercial products available, bugs aside.
In order to at least bench mark or predict resource load for a future
package I’ve included trial packages and/or freeware in my preliminary assessments.
At
the most basic level most modern operating systems offer the ability to attach
sounds to status changes or program execution/termination. There are also many voice recognition packages
that competently work the other way by attaching the recognition of a word to
performing an action. Unfortunately, these
products could use considerable improvement.
They require a training period for each individual who will use the
product just to get words correct which requires manual attachment. This level
of performance is on the margin of practical accuracy. Even “Microsoft Themes”
slow the latest machines noticeably if not to the point of burden.
With
these options one problem persists; they require too much of the resources
available. These burdens are not excessive when only an office application is
being run or when these products go to sleep while a multimedia app is being
run, but when one attempts to make them as attentive as the keyboard and mouse
performance is compromised. This
suggests that the solution to performing these tasks to speed is to go to a
lower level of the machine, to the hardware level.
The
partial hard wiring of components i.e. producing specialized chips like video
chips, or math co-processors before them, only deals with one bottleneck. Bigger problems exist at bigger scopes. A complete interface would require more than
these two tasks. It would require that
the mapping of new commands to actions be ongoing and as simple as saying “Do
this when I say that”. It would also
require that the machine handle ambiguity well, either by making decisions or
asking for clarification. These are
just basic steps compared with discerning intention and being creative with
verbal output. When one considers these
additions the project grows exponentially in complexity and resource
demand.
With
all of these additions requiring more and more of the machine one would think a
practical interface isn’t possible.
However, there is an existing standard for handling all of these
problems, the architecture of the modern PC is built around handling them. When the need for better video was
recognized, better video cards were produced with not only specialized chips
but dedicated RAM, and multiple chips forming a complex architecture of their
own. Every component of the PC has
evolved independently to meet new requirements and when new jobs were invented
new cards followed.
The
amount of “horsepower” which even rudimentary modules require suggests that
hardware is the only solution.
Prototyping, however, need not be solely the domain of the board
makers. There is one card that can act
like any card with programming alone.
By dedicating a whole PC to one task or module and passing this behavior
to the main PC through a network card, one can simulate what a new
interface-card would do.
Prototyping
in this manner requires multiple PCs an understanding of parallel processing
and a healthy knowledge of network communications.
Using
the human model, an interface would use redundant modules. For instance, multiple modules would be
launched simultaneously to interpret a word and the most complex would be given
priority until a timer expires. This
sequence would cascade until the default device is forced to “make the call”,
this being an embedded system like the ones in cellular phonemes. I believe a similar cascade is used by
operating systems for hardware abstraction, the BIOS being the default
device. The difference is the scale of
the timer.
Voice
recognition requires the interpretation of one point in the flow to hint what
the next point might be. At the
smallest grain of this flow are the individual “phonemes”. These are the simplest sounds that can be
differentiated and are linked to the mechanics of the mouth and vocal chords.
For this reason, one can rule out, as a simile, certain groupings of consonants. So once one phoneme is recognized, the range
of potential successors is reduced. We
ourselves will catch a whole sentence and if it doesn’t make sense will
reevaluate each phoneme asking ourselves “Am I hearing this right?” Cadence
also comes into play at this stage and at larger grains.
On
a larger scale, many phrases could be mistaken for another if just the right
pattern of phonemes was mispronounced, sometimes being the same phoneme as when
a person lisps. A recent cellular phoneme
commercial has a woman saying “I just asked how are the kids, and she flowers
the kids.” This requires large chunks
of conversation be cached or at-least translated into context flags.
Finally, as the structure of at-least the
sentence stage of sound interpretation gets built up from smaller grains, if it
doesn’t make sense (or pass the NLP stage) it must go through iterations of
reevaluation before it can be qualified.
At this stage to keep real-time pace it may be necessary to have multiple
agents simultaneously iterating until a few plausible phrases can be cached and
dumped on the NLP to deal with.
Many
increases in complexity have already been developed. This is lost on the developers I’ve researched up to this
point. The aforementioned embedded
voice recognition is just one of these.
Another is sound filtering. An
expensive microphone can help the software to rule out noises coming from
sources other than the person speaking. Sound studio mixing boards have bandwidth
equalization so that sounds one instrument wouldn’t make are filtered out from
its mic. The timber of one’s voice can
allow extraneous noises to be ignored or even be used to discern the person’s
gender or age and used to help the NLP stage.
When extraneous noise is within the speaker’s range, countless
iterations may be avoided by 3-D sound reception.
At
the NLP stage the simplest interpretation is usually correct so the order of precedence
of the cascade should be reversed from that of the sound. However more complex algorithms must be
launched simultaneously to keep pace in the event that simple analysis fails,
necessitating yet another large structure for what currently is being developed
as a relatively small, stand-alone product.
At the prototyping phase, all this complexity adds up to what might be dozens of machines feeding into three or four, which report to one which in-turn reports to the main PC through a single network card. This card would ultimately be replaced by the specialized interface card of the future. This form of prototyping system is similar to a model for parallel processing which is commonly used today in the form of intranets. The only difference being that the multiple clients/users are replaced with a tree structure with the user’s PC sitting on top of the head node.
As amazingly complex as the modern video chip and CISC CPU are today it might be quite some time before such complexity is worked out for a new generation interface card. We as developers should not shy away from the necessity for networked clusters. If this what is required, the public will accept it just as they accept the many components currently required. Even Apple’s one-piece has a separate mouse, keyboard, printer, microphone, . . . the list goes on. One can take solace in the fact that network kits of expansion cards, cables and hubs are getting cheaper and easier to set up by the novice. When CPU’s were just not powerful enough for CAD programs and video editing, multiple CPU motherboards were common place. Ultimately, there are many options for dealing with overwhelming demand that have been proven through practical use.
The model of a tree of dedicated nodes differs from the neural network model of parallel processing which has been the latest front-runner for approaching artificial intelligence. It is my contention that this specialization is how living creatures accomplish intelligence. This is supported by the fact that the brain can be observed being active in specific regions in response to specific stimulus. This ranges from fundamental tasks such as sound and visual interpretation and physical coordination to what we consider the measure of intelligence; analytical thought and creative thought. To have the complete tree swap out to an equally complex task as speech recognition when nothing is being spoken may have its own catch phrase: “I’m sorry, I wasn’t listening. Could you repeat that?”
Artificial
intelligence is debated by camps championing the various models i.e. neural
networks, the genetic model, etc. Clearly each model has its strength, however
this argument over which one will be the staple misses the point. The fact that we can document the physical
makeup of the brain, which literally is made up of a neural network, and other
levels, which are modularized, suggests that true artificial intelligence can
be realized once we recognize just how complex the architecture of biological
intelligence is.
Eventually
we may produce embedded systems that stun us with their complexity. We may even have PCs powerful enough to
handle such a daunting task with one very big program. But until this happens,
we must concede that a complete interface is going to be much larger than the
bits and pieces we’ve been working on to this point. Therefore, the output of these modules must be developed for fast
inter-module communication and, until hardware is developed to meet demand,
clustering must be considered when developing a new interface.