Archive for May, 2009

On the third questionnaire, one of the subjects we have seen is machine translation. We have been asked to write a short Curriculum Vitae in Spanish and then translate it with three different online-translators (Google Translator, Lucy Translator and Reverso Translator). The results were more or less satisfying, but there were some big mistakes on the Spanish-English translation. Here is an example:

Entre mis aficiones, además de los idiomas, se encuentra la música. Estudié solfeo durante 8 años en el Conservatorio Municipal de Música Bartolomé Ercilla de Durango […]
Among my interests, besides the languages, he|she finds the music. I studied sol-fa|solfeggio for 8 years in the Municipal Conservatoire|Conservatory of Music Bartolomé Ercilla de Durango […]

What’s more, in one online translator we are warned and told that an automatic translation will never have the same quality of a translation done by a person (and the translation will be worse if the language is colloquial). Nevertheless, it is useful and you don’t spend so much time.

Machine Translation is a sub-field of computational linguistics and it is the application of computers to translate a text from one natural language to another. What basic MT does is to substitute words from one natural language to another, but more complex translations use corpus techniques and pay attention to the linguistic typology and translate idioms, among other things.

Users can interact with some translators and make the translations less ambiguous, for some of those systems give the user the opportunity to say which words are names. What others translators offer is a list of suggestions, the user chooses the one which best fits with what he was searching for and if none of the possibilities is what he looks for, he does some changes until he gets what he wants. After the TransType project, the results showed that with this way of translating users didn’t spend so much time an effort.

To sum up, we should add something that Ana Fernández Guerra and Francisco Fernández wrote in the book “Machine Translation, Capabilities and Limitations“. We could make some statements in the activity of translating:

  1. The possibility of translation: we are supposed to reproduce with total exactness every single piece of text or linguistic structure in other language we would find it difficult.
  2. Realize that we don’t translate from one language as a system to another language as a system, but from one text into another text.
  3. We should be cautious about some dogmatic statements.
  4. In the content (or message) of the text we must consider: meaning, designation and sense.

Read Full Post »

As Jim Cowie and Yorick Wilks said in one article, “Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts”. We have to add that Information Extraction is a technology based on analyzing Natural Language, and when the fact about a topic is taken from a document, it is automatically entered into a datasabe. Computational Linguistic techniques play an important role on IE, because IE, in a way, is interested in the structure of the text, unlike IR, which understands texts as “bags of words”.

When the user enters a word or sentence, he only gets the specific information he is interested in (after a process of text analysis). So, instead of documents, which is what Information retrieval offers, we get just the information we need. That information has been probably taken from a collection of documents, but it has been summarized.

IE is getting more and more important, for the amount information available on the internet grows everyday. People can get to that information more easily thanks to marking-up the data with XML tags, among other things. And not only “people” turns to IE, but also groups use it to summarize medical documents or build medical and biomedical ontologies.

These are the most common subtasks on IE:

  • Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
  • Coreference: identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference.
  • Terminology extraction: finding the relevant terms for a given corpus
  • Relationship Extraction: identification of relations between entities, such as:

It hasn’t reached the market yet, but it could become a great helper to industries of all kinds (this is an example from Yorick Wilks and Jim Cowie “finance companies want to know facts of the following sort and on a large scale: what company take-overs happened in a given time span; they want widely scattered text information reduced to a simple data base”).

Read Full Post »