As Jim Cowie and Yorick Wilks said in one article, “Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts”. We have to add that Information Extraction is a technology based on analyzing Natural Language, and when the fact about a topic is taken from a document, it is automatically entered into a datasabe. Computational Linguistic techniques play an important role on IE, because IE, in a way, is interested in the structure of the text, unlike IR, which understands texts as “bags of words”.

When the user enters a word or sentence, he only gets the specific information he is interested in (after a process of text analysis). So, instead of documents, which is what Information retrieval offers, we get just the information we need. That information has been probably taken from a collection of documents, but it has been summarized.

IE is getting more and more important, for the amount information available on the internet grows everyday. People can get to that information more easily thanks to marking-up the data with XML tags, among other things. And not only “people” turns to IE, but also groups use it to summarize medical documents or build medical and biomedical ontologies.

These are the most common subtasks on IE:

  • Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
  • Coreference: identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference.
  • Terminology extraction: finding the relevant terms for a given corpus
  • Relationship Extraction: identification of relations between entities, such as:

It hasn’t reached the market yet, but it could become a great helper to industries of all kinds (this is an example from Yorick Wilks and Jim Cowie “finance companies want to know facts of the following sort and on a large scale: what company take-overs happened in a given time span; they want widely scattered text information reduced to a simple data base”).


Speech Recognition is a branch of Artificial Intelligence that enables spoken communication between human and computer, but there are some difficulties in the attempt of getting a more or less acceptable interpretation of the message, because the coopertation between information from different sources (such as the acoustic, phonetic, semantic or pragmatic) is ambiguous and some mistakes are unavoidable in the process.

Nearly all the Speech synthesizers use libraries of speech sound. The creation of this dicctionarie is important, because it is important to recognize the word user uses. To make the recognition easier, here is a recognition of vowels and recognition of consonants, and also a noise masking (some movile phones, for example, can work when we “talk” to them, and if we are on the street, there must be something that makes the sound clear). But even if the system has these advantages, mistakes may not be avoided. Most speech recognition algorithms rely only on the sound of the individual words, and not on their context, so they don’t understand speech, but recognize words. Here is an example of what could happen:

The child wore a spider ring on Halloween.

He was an American spy during the war.

The sound of “spider ring” and “spy during” is exactly the same. We hear the correct words depending on the context, and is something that we do unconsciously.

There are many ways of application of this system, but I think that the fact people with disabilities benefit from it is the most interesting. Some of them are unable to use their hands, others are deaf and use deaf telephony (voicemail to text, realy services or captioned telephone), and others have learning disabilities. There’s no doubt that our life will be easier in some years’ time when this systems get better.

When we search for information on the net we can obtain it from different places and in different ways, for there is loads of available data on the internet.

Question answering, also known as QA, is a way of getting that information; this system should be able to answer our questions (done in natural language), searching in pre-structured database or documents written in natural language.

As Dell Zhang and Wee Sun Lee wrote in one article “it is important for an online question answering system to be practical, because it is time-consuming to download and analyze the original web documents”. A question answering system is another information retrieval system, but what QA systems do is supply just the information we need, not a list of possibilities as searching engines usually do. To obtain the answers, the QA systems combine some NLP techniques, because the answer depends on the type of question.

And as I have told, depending on the question, the methods used to find the answers are different. There are two methods: shallow and deep. The first one finds fragments of documents, filters the information based on the presence of the answer required, and then the answers are ordered based on different criteria, such as word order. If the way the question is formulated is not enough (or, for example, some of the questions based are classified with an incorrect type), the second method is used. “More sophisticated syntactic, semantic and contextual processing must be performed to extract or construct the answer”.

So, there have been many advances on this kind of information retrieval systems, but dealing with Natural Language with computers is quite difficult, and it can be hard to get the data we are looking for using that kind of language with systems that have to improve a lot.

Here is the list of 10 research topics in major sites on Human Language Technologies I have chosen:

  1. Machine Translation
  2. Question answering systems
  3. Machine Learning in NLP
  4. Development of linguistic resources and tools
  5. Reconocimiento y síntesis de voz (Speech Recognition and Synthesis)
  6. Intelligent systems for natural language interaction
  7. Information retrieval, question answering, and information extraction
  8. Monolingual and multilingual text generation
  9. Lexical semantics and word sense disambiguation
  10. Human factors in MT and user interfaces

I’ll write one article for each topic that I have put in bold.

Topics taken from:

There are some important researchers on the field of Human Language Technologies (HLT). One of those researchers is Martin Kay. As he says, his main interests are translation (by people and machines), and computational linguistic algorithms, specially in the fields of morphology and syntax. He is well known for his work in computational linguistics; what’s more, he started to work in one of the earliest centres of Computational Linguistics research: the Cambridge Language Research Unit. He is nowadays Professor of Linguistics at Stanford University, and the developments he has made in the field of  Human Language Technologies in subjects such as chart parsing and functional unification grammar have to be mentioned, as well as the fact that he has been regarded as a leading authority on machine translation.

Another important researcher is Yorick Wilks, a British Computer Scientist who is a Professor of Computer Science at the University of Sheffield. There he directs the Institute for Language, Speech and Hearing. He wrote an algorithmic method “for assigning the “most coherent” interpretation to a sentence in terms of having the maximum number of internal preferences of its parts (normally verbs or adjectives) satisfied”. In the 1090s he got interested in modeling human-computer dialogue, and in this time he is the Director of the EU funded Companions Project on creating long-term computer companions for people.

Hans Uskoreit is also a researcher that has to be mentioned. He is Professor of Computational Linguistics at Saarland University and head of the DFKI Language Technology Lab, as he serves as Scientific Director at that German Research Center for Artificial Intelligence. During his career he has affiliated with several centers and he is member of lots of associations, as The European Academy of  Science or the International Committee of Computational Linguistics.

The Human Language Technologies (HLT), also known as Language Technologies or Natural Language Processing (NLP), are closely connected to computer science and linguistics.  HLT enables people to interact with machines with more ease.  We find an example of how HLT can help people:  “This can benefit a wide range of people – from illiterate farmers in remote villages who want to obtain relevant medical information over a cellphone, to scientists in state-of-the-art laboratories who want to focus on problem-solving with computers.”

As Hans Uszkoreit wrote in one of his publications, there is a problem in the interaction between human and machine, for there is a communication problem. Machines’ language and human language is not the same since machine’s domain of language is very restricted. But with NLP, the data used by computers becomes readable for human; it designs mechanisms of communication which work with programs that simulate the communication.

But, although there have been many advantages in this field, we still can find  some difficulties when we communicate with a computer. When we enter a sentence, it is likely that some words have more than one meaning; and if we don’t pay attention to the structure of the sentence, it can become ambiguous for the computer and it may not understand what we intended to say. But, as the researcher previously mentioned said, “the whole world of multimedia information can only be structured, indexed and navigated through language”, so it is just a question of years and development that HLT works without any problem.


2371165319_4c29d22227Rich Site Summary or Really Simple Syndication is what RSS means. This is a “format for syndicating news and the content of news”. If you look for some information with the RSS format, it is likely that the information you get is more or less what you wanted, and you get it quickly and updated as well.

Its structure is made up of  items, and each item has a title, a summary of  a text and a link to the original source in the web where the whole text is located. The RSS files have a summary of what has been published in the original website, but there are not only news, but also changes on a website can be shown, or “the revision history of a book”.

You can obtain and offer information with the RSS, since those files contain meta data about the information sources; but to share information, some software and an aggregator are needed. The programs that can read the RSS sources are the feeds, and the aggregator can be installed in the user’s computer, although some searchers have it included in their programs, and another way is to register in the web site of the aggregator.

So if you like a site and you know that you’ll visit it quite often, you should register to a feed, for you are informed when the site is brought up to date, when new information has been included etc. And if you are the one who wants to offer information, you have to create your own feed and update it quite often to make it interesting for the rest of the users.