A Pleasant Interface

At this stage it is evident that the first few steps of building a complete package are available and in a variety of forms. However, there is a large chasm between these steps and the first practical joining of the pieces. It is for this reason that my project begins with the study of existing programs, continues with the most viable solution to their shortcomings, glances at their position in the overall picture and, for the moment, stops at the most efficient prototyping method for the next few steps.

There are literally hundreds of web sites touting Speech Recognition or Natural Language Processing. In the interest of not making this a commercial venture just yet, I’ve over looked those asking for money or requiring personal information before showing what they’re up to. Most of the sites willing to discuss GNU software are just hype. There is a practice among developers to go GNU right up until their project resembles something valuable, then get commercial. So, I’m sure there are some blazing prototypes out there which shame the commercial products available, bugs aside. In order to at least bench mark or predict resource load for a future package I’ve included trial packages and/or freeware in my preliminary assessments.

At the most basic level most modern operating systems offer the ability to attach sounds to status changes or program execution/termination. There are also many voice recognition packages that competently work the other way by attaching the recognition of a word to performing an action. Unfortunately, these products could use considerable improvement. They require a training period for each individual who will use the product just to get words correct which requires manual attachment. This level of performance is on the margin of practical accuracy. Even “Microsoft Themes” slow the latest machines noticeably if not to the point of burden.

With these options one problem persists; they require too much of the resources available. These burdens are not excessive when only an office application is being run or when these products go to sleep while a multimedia app is being run, but when one attempts to make them as attentive as the keyboard and mouse performance is compromised. This suggests that the solution to performing these tasks to speed is to go to a lower level of the machine, to the hardware level.

The partial hard wiring of components i.e. producing specialized chips like video chips, or math co-processors before them, only deals with one bottleneck. Bigger problems exist at bigger scopes. A complete interface would require more than these two tasks. It would require that the mapping of new commands to actions be ongoing and as simple as saying “Do this when I say that”. It would also require that the machine handle ambiguity well, either by making decisions or asking for clarification. These are just basic steps compared with discerning intention and being creative with verbal output. When one considers these additions the project grows exponentially in complexity and resource demand.

With all of these additions requiring more and more of the machine one would think a practical interface isn’t possible. However, there is an existing standard for handling all of these problems, the architecture of the modern PC is built around handling them. When the need for better video was recognized, better video cards were produced with not only specialized chips but dedicated RAM, and multiple chips forming a complex architecture of their own. Every component of the PC has evolved independently to meet new requirements and when new jobs were invented new cards followed.

The amount of “horsepower” which even rudimentary modules require suggests that hardware is the only solution. Prototyping, however, need not be solely the domain of the board makers. There is one card that can act like any card with programming alone. By dedicating a whole PC to one task or module and passing this behavior to the main PC through a network card, one can simulate what a new interface-card would do.

Prototyping in this manner requires multiple PCs an understanding of parallel processing and a healthy knowledge of network communications.

Using the human model, an interface would use redundant modules. For instance, multiple modules would be launched simultaneously to interpret a word and the most complex would be given priority until a timer expires. This sequence would cascade until the default device is forced to “make the call”, this being an embedded system like the ones in cellular phonemes. I believe a similar cascade is used by operating systems for hardware abstraction, the BIOS being the default device. The difference is the scale of the timer.

Voice recognition requires the interpretation of one point in the flow to hint what the next point might be. At the smallest grain of this flow are the individual “phonemes”. These are the simplest sounds that can be differentiated and are linked to the mechanics of the mouth and vocal chords. For this reason, one can rule out, as a simile, certain groupings of consonants. So once one phoneme is recognized, the range of potential successors is reduced. We ourselves will catch a whole sentence and if it doesn’t make sense will reevaluate each phoneme asking ourselves “Am I hearing this right?” Cadence also comes into play at this stage and at larger grains.

On a larger scale, many phrases could be mistaken for another if just the right pattern of phonemes was mispronounced, sometimes being the same phoneme as when a person lisps. A recent cellular phoneme commercial has a woman saying “I just asked how are the kids, and she flowers the kids.” This requires large chunks of conversation be cached or at-least translated into context flags.

Finally, as the structure of at-least the sentence stage of sound interpretation gets built up from smaller grains, if it doesn’t make sense (or pass the NLP stage) it must go through iterations of reevaluation before it can be qualified. At this stage to keep real-time pace it may be necessary to have multiple agents simultaneously iterating until a few plausible phrases can be cached and dumped on the NLP to deal with.

Many increases in complexity have already been developed. This is lost on the developers I’ve researched up to this point. The aforementioned embedded voice recognition is just one of these. Another is sound filtering. An expensive microphone can help the software to rule out noises coming from sources other than the person speaking. Sound studio mixing boards have bandwidth equalization so that sounds one instrument wouldn’t make are filtered out from its mic. The timber of one’s voice can allow extraneous noises to be ignored or even be used to discern the person’s gender or age and used to help the NLP stage. When extraneous noise is within the speaker’s range, countless iterations may be avoided by 3-D sound reception.

At the NLP stage the simplest interpretation is usually correct so the order of precedence of the cascade should be reversed from that of the sound. However more complex algorithms must be launched simultaneously to keep pace in the event that simple analysis fails, necessitating yet another large structure for what currently is being developed as a relatively small, stand-alone product.

At the prototyping phase, all this complexity adds up to what might be dozens of machines feeding into three or four, which report to one which in-turn reports to the main PC through a single network card. This card would ultimately be replaced by the specialized interface card of the future. This form of prototyping system is similar to a model for parallel processing which is commonly used today in the form of intranets. The only difference being that the multiple clients/users are replaced with a tree structure with the user’s PC sitting on top of the head node.

As amazingly complex as the modern video chip and CISC CPU are today it might be quite some time before such complexity is worked out for a new generation interface card. We as developers should not shy away from the necessity for networked clusters. If this what is required, the public will accept it just as they accept the many components currently required. Even Apple’s one-piece has a separate mouse, keyboard, printer, microphone, . . . the list goes on. One can take solace in the fact that network kits of expansion cards, cables and hubs are getting cheaper and easier to set up by the novice. When CPU’s were just not powerful enough for CAD programs and video editing, multiple CPU motherboards were common place. Ultimately, there are many options for dealing with overwhelming demand that have been proven through practical use.

The model of a tree of dedicated nodes differs from the neural network model of parallel processing which has been the latest front-runner for approaching artificial intelligence. It is my contention that this specialization is how living creatures accomplish intelligence. This is supported by the fact that the brain can be observed being active in specific regions in response to specific stimulus. This ranges from fundamental tasks such as sound and visual interpretation and physical coordination to what we consider the measure of intelligence; analytical thought and creative thought. To have the complete tree swap out to an equally complex task as speech recognition when nothing is being spoken may have its own catch phrase: “I’m sorry, I wasn’t listening. Could you repeat that?”

Artificial intelligence is debated by camps championing the various models i.e. neural networks, the genetic model, etc. Clearly each model has its strength, however this argument over which one will be the staple misses the point. The fact that we can document the physical makeup of the brain, which literally is made up of a neural network, and other levels, which are modularized, suggests that true artificial intelligence can be realized once we recognize just how complex the architecture of biological intelligence is.

Eventually we may produce embedded systems that stun us with their complexity. We may even have PCs powerful enough to handle such a daunting task with one very big program. But until this happens, we must concede that a complete interface is going to be much larger than the bits and pieces we’ve been working on to this point. Therefore, the output of these modules must be developed for fast inter-module communication and, until hardware is developed to meet demand, clustering must be considered when developing a new interface.