Andrew Hussie
CIS0203 Project Final

Natural Language Processing and Basic Reasoning

NLP Applet -must have current JDK installed to function

Natural language processing(NLP) is the ability for a computer to examine statements which are in a human languange such as English, and derive meaningful information from it in such a way that correlates to how a human understands the statement. In short, it is to get machines to "understand" our languages. NLP is one of the older topics in AI. It was thought(and still is) that the division between humans and computers could be narrowed by programming computers to accept language which is natural to people as commands rather than unwieldly, non-intuitive coded instructions. Another early application was to have a computer translate from one language to another, such as from french to german. A more recently pursued application of NLP is in the field of information retrieval/aquisition, combing through large amounts of text searching for elements of a certain meaning, rather than just key words or phrases. This has become usefull in fields related to the internet, such as data mining, web/article searches, or generating summaries or guides to large amounts of text which might be instructional, scientific, etc.
NLP Applications:
---Easier machine interface
---Language translation
---Data mining
---More intelligent searches
---"Chatter Bots" - www.alicebot.org
---Unified systems(CYC, HAL...)

There has been little success in these fields, that is no solutions have been perfect or fool-proof. The translations generated by language translators tend to look "funny", and have trouble consistently constructing natural looking or sounding statements in the new language(an example, babel fish- http://world.altavista.com/tr). Programs which are more intended to extract meaning from text have trouble grasping particular semantics, stumble over ambiguous words and phrases, over-interpret things they shouldn't like figures of speech, or sometimes miss the point entirely. The solutions and programs which exist do not fall short for a lack of trying. Many people have devoted much time to this problem. There is no perfect solution today because this problem is widely recognized as being very difficult.

It is nort hard to see why so many have had such trouble with NLP. The problem mainly is due to the qualities of human language itself. Human language has the appearance of being a well defined system with easy rules to follow. But we assume this because we speak it so easily and naturally. On close inspection, our languages are not quite as well defined, or easy to define well, as one may think. For instance, in any language, so much of our intended meaning relies on context- the setting, events, people and objects have tremendous impact on meaning very often. There is a great deal of ambiguity with any language, such as words which refer to unstated concepts or objects, like pronouns(it, he, they,...), or words which are spelled the same but mean different things(duck and duck), or words which even mean different things in different contexts(to, of). There are figures of speech which can distort the meaning greatly if not handled properly, and there are idioms in languages which are far more pesky and subtle than figures of speech, the list of which is usually quite long(add up, hand out, save face,...). There is even more subtlety involved in natural language when you consider that the tone with which the statement is spoken can change its meaning, or even that the meaning may have nothing to do with the actual words- the meaning may be implied somehow, or there may be irony. The human mind seems to be able to handle this loose structure of inferences and associations, but a computer has much more trouble with it. A human may find it easier to learn and handle this volume words, rules, and exceptions because he or she has a vast reservoir of diverse and textured experiences to relate to the language and derive meaning within that context. However a computer has only what the programmer gives it, and can derive meaning only so far as it relates to its own internal methods and operations. That is largely what makes the task of determining "meaning" with a computer so challenging. On the other hand, a system like chess appears to be a better defined than the English language. Thus a computer may handle tasks relating to that system with relative ease.
Difficulties with NLP
---Deriving "meaning" with respect to a computer
- - - - -executing commands or operations
- - - - -inferring intent of speaker
- - - - -deriving statements of fact and logic
---Inherent problems with human language
- - - - -not as well defined as one may think
- - - - -many exceptions to defined rules (like plurality- house,houses - mouse,mice, etc.)
- - - - -much ambiguity - pronouns, "that", "this", words spelled the same, words used differently(of, to)
- - - - -figures of speech - idioms(add up, hand out, shape up, turn in...)
- - - - -context - meaning can change depending on setting
- - - - -tone and circumstance- tone of voice changes meaning - or meaning is very different from actual words(irony)
- - - - -human languages can vary greatly

 

NLP Programming Project

I've included a programming project to experiment with various approaches to NLP for the English language. I began the project by outlining a basic solution to the whole problem of interpreting natural language without looking too thoroughly into the standard approaches to the problem which have been documented. This way, i felt I could "get to know" the various aspects of the problem first hand, and know why certain parts of the problem are considered so difficult. Then later I examined in detail what others had done and compared it to what I had already figured out, and used the information to improve my solutions.

I first outlined loosely what this program should be able to do. The very concept of NLP implies that natural language is examined for meaning. But the meaning must be with respect to something- e.g. some search criteria, some set of executable processes. It cannot have the type of meaning that humans generally derive from the text. So I choose one of the more straight forward ways of representing meaning derived from text- that is simple statements of logic and reasoning which pertain to the subjects of the text. Subjects of text could include any identified noun or nominal-type expression along with anything that qualifies them, and relevant information could be any description or action pertaining to them. For instance, the dog may be a relevant object, and is brown or ran away could be pertinent descriptions. The program could keep track of all encountered combinations as statements of truth, and then using simple forms of reasoning deduce other statements. The methods for logical analysis need not be too intensive, because it is not the most important aspect of this project. They are just ment to provide an interesting way to apply the meaning which is derived from statements.

Syntax:
Once I established the purpose, as well as the "container" for the meaning of statements to be derived, I started laying out the very basic needs of NLP for this project. The program must have a system for evaluating the structure of statements, as well as the roles of the individual words within that structure. This is where a syntax for the language must be identified and converted into useful rules for the production and analysis of statements. A syntax will describe how certain types of phrases can be assembled using atomic word types, and how further phrases may be assembled using appropriate combinations of other phrases. There are a number of different documented ways English has been broken down in this fashion, and virtually all who have constructed NLP programs have either adopted some existing syntax or come up with their own to suit specific purposes. The syntax for a natural language is generally quite complicated and extensive, with many word types, rules and exceptions. I began by outlining a very simple syntax for english, with a few atomic word types like nouns, verbs, adjectives, prepositions, and simple higher level phrase types like noun phrases and verb phrases. Over time the syntax expanded with more types and more complicated structures as I began to research english grammar conventions and what others had done with NLP. Below is a graphic documenting the working syntax for this program, along with each atomic word type.


...

Once a proper syntax for the language is established, the program must have methods for parsing or interpreting sentences to figure out what the diagram for that sentence is. A fairly simple algorithm employed by some NLP programs is to assume a group of words is a certain type of phrase, then open up that node to see if its components fit the model for that phrase type, and keep doing so until their is negative or positive proof. For instance, if we assume that the input text is a sentence, we may open that node to view the possibilities for its structure:
  SENT -> NP VP  
  SENT -> VP NP  
  SENT -> {{T} SENT} T SENT  
  SENT -> SENT {{C} SENT}  
  SENT -> {I} SENT  

Then we examine the first possibility and assume the sentence follows that structure. So we open the node of the first element:
  NP -> {A ^ Q} MP {PP ^ (VQ VP)}  
  NP -> Q {NP {VP}}  
  NP -> NP {{C} NP}  
We continue to examine structures this way until it fails and we must examine the next possibility, or it succeeds and we proceed with the next necessary analysis. The atomic types are not opened in this fashion- they are simply tested to verify if the word fits that type(although we could say that verbs could be nodes to see if they fit the infinitive form).

I've adapted an algorithm like this for parsing sentences, although it is not necessarily stricly left to right analysis. Sometimes it is more convenient or definitive to look for a particular type of word or phrase in any location throughout the sentence to establish context and narrow down the possibilities more quickly. For instance, when evaluating something which may be an NP, it is possible that there will be a PP in it, consequently a preposition in it somewhere. If there is, it is certain that the noun will occur just before it according to our syntax. This is more convenient than looking at every word starting left, figuring out if the words are qualifiers, modifiers or nouns, which is especially difficult when working with a very limited vocabulary and imperfect knowledge.

Vocabulary:
This brings up an important issue of NLP. A program must have means of remembering the words it encounters. It also should remember what purpose those words served, such as what type they are, so that sentences later encountered which contain those words may be identified more easily. Also, the stored information about words may be used when later trying to extract meaning from statements, whatever form that meaning takes.

It must be decided how individual word are going to be kept track of in memory. A very rigorous, intensive approach could be to keep track of words at their root level, and have them be reference and assembled by their basic parts like prefixes, suffixes, common roots and the like. This gets into the field of morphology, the study of word origins and structure by basic meanings, and has been a big part of the NLP field. However, I have not delt with this issue in this program. It adds a tremendous level of complexity to the project, particular since morphological rules can be deceiving to the program. For instance, imagine telling a machine why "slowly" and "noisily" are adverbs, but "imply" and "family" aren't, or why "hiking" is a verb form, but "viking" isn't, or why "decompose", "denounce", "deflate" have a negative element, yet "detect", "detour", "decide" do not. With a morphological mode of memory and word construction, you can see there will end up being nearly as many exception as rules. If executed properly and thoroughly though, it can be quite valuable and shed a great deal of light on the meanings of words. However, such an expansive task would not be appropriate for this project.

This program has a more straight forward mode of vocabulary. Any distinct chain of letters it finds, it perceives as a word and stores it if it is unknown. That means very similar words, or words of similar roots(like run, runs, running, runner) have no inherent relation to eachother in memory. This is less efficeint(or less "intelligent" you could say), but it avoids the difficulties of a system of roots.

Learning/Adaptation:
Not only must a program keep track of words, it must keep track of some kind of information relating those words. The most rudimentary kind of information required is what type of word it is(or most likely is). As stated, one of the great difficulties of english and NLP is that many words can be several different types. For instance, virtually all verbs can be nouns(to walk, or go for a walk), and many nouns can be verbs as well. Also, as stated in the working syntax, any noun can be a modifier as well. The words in memory should have some capacity to be multiple types, and some kind of built-in hierarchy of which type occurs when. One way is to have context be the most deciding factor for which type the word is at one moment. Another way is to rank the possible types of a word according to the frequency of its occurence. I've employed some degree of both of these options in the program. The words in memory are simple assigned integer values for each possible type, which represent the amount of times that word has been recognized as that type(such as N=20, M=4, Q=0,...), and is incremented each time that type occurs again. That way, the words are easier to recognize based on the highly ranked types carried with them, along with some contextual aid when parsing a sentence.

But this is not very useful when the program is starting from scratch, since it has no words in its vocabulary to recognize. There are a couple things in the program to help out in the beggining. First there is a list of predefined words in the vocabulary to begin with. They are words which are extremely common, and largely essential to communicating in english- such as state of baing verbs(is, was, be,..), pronouns, basic qualifiers(that, this,..) most prepositions, all helping verbs and conjunctions, many trantions, and so on. They are ranked as the types they should be with high values. However, they can theoratically be "overridden" in time by using them incorrectly enough.

Another way the program is aided in the early stages of its vocabulary and development is by offering the user intermittant queries about the nature of words or word groups as it analyzes sentences. It may find itself at a juncture where it's essential to know the type of an unknown word in order to parse a sentence correctly, thus assigning the correct types to all the other words. It may ask questions like "Is this word a verb?", and acts appropriately depending on the answer. There are certain settings that will make it act more or less autonomously, such as always treating words following "to" an infinitive verbs in certain cases, and other such conditions. It can be turned off alltogether to allow it to always make its "best judgement" on the statements. In terms of having a useful program, it is desireable to get it to "know" enough where it can interpret reasonably accurately without asking any questions. This will indicate that it has "learned" enough about english so that it may begin extracting meaning from statements more consistently. Of course, this state is a good distance from the program's initial state, and that is even assuming the program's learning and adaptation algorithms are effective.

Another way the program could be adaptive(though this hasn't been applied) is to keep track of and document newly encountered phrase and sentence structures. English sentences can be very complicated structurally, and the syntax used in this program does not cover the full range of possible structures. It could have some type of dynamic syntax, where structures could be added or removed, and when it encounters something unknown, it could document it and look for it in the future. This way, it could be much more flexible in its interpretations.

Interpretation and Logic:
This program is not yet at the stage where is constucts logical statements from parsed sentences, but it is primed to do so. Stored with each word in the vocabulary, along with data representing its types, are boolean variables which indicate qualities of logic the the word possesses. These are qualities which could indicate negativity(such as not,no), universality(all,every), existentiality(some,a,one) and the like. Others are additive(and,also), alternative(or), conditional(if), implicative(then), etc. Some are not even qualities pertain to logic in a classic sense, but are qualities which might pertain to structure or meaning, such as referential(word which indicate something unspoken, like that, it, he,...), or possessive(his, mine), or general(words which there are clearly more than one instance of, like cars, dogs, trees,...).

Some of these qualities assign special logical meaning if grouped, such as negative and universal(not all, none), negative and additional(nor), negative and general(the Eiffel Tower), and the like.

Data like this can provide much additional insight into the meaning of statements during interpretation. In the case of this program, such qualities can allow it to make more useful statements of logic, more powerful and detailed deductions from those statements, and better ability to interpret statements in the future, assuming it has the necesary methods of implementation.

AIML
AIML(Artificial Intelligence Markup Language) is part of an interesting alternative approach to NLP which I encountered. It was developed and used by the ALICE foundation(creator of Alicebot mentioned earlier). The philosophy is very different from the traditional philosophies of NLP, which I've already summarized in this report. For instance, it does not rely on examining syntax and diagramming sentences to evaluate their meaning, nor does it have any protocal for getting meaning based on word definitions or predefined traits of words. It doesn't even have a vocabulary in the strictest sense of the word. It relies on the concept of stimulus response, which is, simply put, prerecorded auto-responses to user input. For instance, the user types "Hi", and it matches that input with a variety of responses, like "hello", or "hi there".

A segement of AIML may look like this:
<aiml>
<category>
<pattern>what color is the sky?</pattern>
<template>blue.</template>
</category>
</aiml>

The AIML tag <category> marks off input/response pairs(which are represented by <pattern> and <template> tags). The real power of AIML begins with its recursive capabilities- its ability to map certain inputs into other inputs before evaluation. For instance, the segement:
<pattern>Are you interested in *</pattern>
<template><srai>Do you like *</srai></template>
replaces the input "Are you interested in *" with something that means the same, "Do you like *", so the two statements call the same response.

This is a very simple approach which is very light on any kind of lingual or cognitive theory or programming, but due to the recursive and generalized statements, it can actually generate some fairly sophisticated responses with the right AIML programming behind it. The chatterbot ALICE developed with AIML actually won the Loebner prize the past two years, which is a contest to see which machines can come closest to passing the Turing test. Apparently, ALICE was able to "fool" human judges more often than other NLP systems.

This type of application is particularly useful for informational guides, interactive knowledge bases or tutorials and the like.

The Program so far:
In the form of a java applet I took a very straight forward object-oriented approach. The lastest JDK must be installed to run it in a web page. Here is a graphic outlining the structure of classes used to this point.

 

Sources:

Karen Spark Jones, University of Cambrodge
http://www.cs.columbia.edu/%7Eradev/acl/karen/

George Miller
http://www.cisp.org/imp/march_2001/miller/03_01miller.htm

http://cogsci.ucsd.edu/%7Ebatali/108b/lectures/natlang.txt

http://www.alicebot.org/