Archive for May, 2011

Google Translate

On the following lines, it will be analyzed one of the most succesful translators of this century-Google Translate. This translator is a free on-line statistical machine service owned by Google Inc that translates immediately a lot of different languages (57) such as Polish, German, Dutch, Spanish… However, it has to be said that some languages are better translated than others, in other words, some languages are supported by Google translate and others languages are called by the company “alpha languages”, this is to say that these languages have lower quality in their translations.

It is possible to translate long texts, but the system limits the number of paragraphs. Nevertheless, if the user wants to translate completely a website, Google Translate gives him or her the opportunity to use Google chrome which is a fast free browser that translates websites automatically in many languages. Not only  does Google translate give you the opportunity to use Google chrome, but also other tools such as to the Google translated search (the information that you are searching probably will not be in your own language; the system searches the best contribution and translates it to your own language) or the iphone version which allows voice input.

The aim of this enterprise is “to make information universally accessible, regardless of the language in which it is written” That is why it has been improving  since it started. Nowadays, it can be done many things that could not be done at the beginning. For example, in the first version, only English could be translated to some other languages, now it can be done the other way round. Moreover, it is also possible to have the romanization written for languages such as Chinese or Greek and, in the last version launched in January 2011, it is also possible to see different possible translations for a specific word. A good way that helps this translator to improve is that the user himself can increase the quality of  translations by suggesting improvements or uploading his translations memories into Google Translate’s Translator Toolkit. Furthermore, the service itself asks the user sometimes alternate translations for technical terms.

But, how does this translator work? As it has been said, Google Translate is a Statistical Machine Translator (SMT) which is a way of translating texts completely different from the traditional rule-based translations. The rule-based  machine translations were used some years ago and they applied the rules and grammars of the language that was being translated. However, Linguists knew that not all languages had the same rules (e.g the order of some languages is subject- verb-object but in others is verb- subject-object) that is why the translations were not very good.

 Then, it began statistical machine translations where the computer looks for patterns in millions of documents. This documents had already been translated by human beings and thanks to them the computer can know more or less how the translation should be. However, the translations are not always perfect and the quality of them depends mainly on the number of documents that the computer can analyze to see patterns. That is why Google Translate can translate better, for example, German than Basque, it has more German documents than Basque Documents. Franz Josef Och is the main head in Google and he is in favour of Statistical machine translators. The documents that are available for the machine are taken from United Nations documents.

Finally, this way to translate texts has advantages. For instance, the quality is better than in rule-based translations, also, the translations are more natural and we have better use of resources. But, there are some disadvantages and problems with: sentence alignment, different word orders, compound words, idioms, morphology

Do not hesitate to see the following video that explains how SMT works . If you are interested in knowing more about the problems Google Translate has, you can see the portfolio I did commenting the main problems here: http://wiki.littera.deusto.es/en/index.php/User:1adcaden/trans0910/Portfolio


References:

Advertisements

COCA-Corpus of Contemporary American English

Nowadays, students of foreign languages, teachers or linguists have many tools available for learning new languages or improving their knowledge of that specific language they are studying. However, many people do not know of the existence of these tools and they cannot take advantage of them. Students can use translators, dictionaries, grammars… One tool that can be very useful when studying a language at a high level and how this language is structured is corpus linguistics. On the following lines, it will be described what is corpus linguistics and one specific corpus that has become very popular. This corpus is called The Corpus Of Contemporary American English (COCA) made by the important professor of  Corpus Linguistics Mark Davies at Brighman Young University.

For instance, What do we understand by Corpus linguistics? The definition by Wikipedia is the following:

 Corpus Linguistics is the study of language  as expressed in samples (corpora) or “real world” text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely derived by an automated process.

At first sight, it can seem that Corpus Linguistics is better to the study of  a language rather than grammars because in Corpus samples we have how the language is really used by native speakers. However, this system can also have some disadvantages. For example, as Noam Chomsky said, real language is also riddled with performance-related errors and that is why it is needed careful analysis of small speech samples , but this is not included in Corpus Linguistics because Linguists only include big examples. Nevertheless, this field has been improving and,nowadays, we have very good Corpus which include may samples and very well structured. One Corpus that has to be mentioned is the COCA one.

The Corpus Of  Contemporary American English is a free on-line corpus that has 425 million words and 160,000 different texts that come from a variety of sources and genres. It is the largest corpus of American English currently available.Moreover, it has been including 20 million words each year since 1990. More than 40,000 users visit this corpus each month. The different genres or sources are, firstly, spoken (85 million words) from 150 TV and radio programmes.Secondly, fiction (81 million words) from short stories and plays and, then, popular magazines (86 million words), newspapers (81 million words) and academic journals (81 million words). Furthermore, users can search the frequency of a word in each genre which help us to know, for example, if a word is used in academic writing or not. It is also possible to compare how the use of certain words has changed over time from 1990 to present time and to ignore one specific genre when we think that it is not going to be useful.

But, why is this Corpus so good? There are many reasons. For instance, researches of this corpus have been working many years to improve this corpus and their work is also connected to other important Corpus such as the British International Corpus, Time Corpus or the Corpus of Historical American English (COHA). There are also updates with new words from time to time; the last one has been in 2011. Users can search many things within the interface. For example, exact words (e.g: mysterious), part of speech, lemmas which are all the forms of a word (e.g: sing which is the base can have many forms such as singer, song, singing…), wildcard which is an option that gives you the system when you do not know exactly how a word is written( e.g: un*ly; the system’s answers would be unlikely, unusually…) It is also possible to search for collocates within a ten-word window (e.g. all nouns somewhere near faint, all adjectives near woman, or all verbs near feelings)

Other good points are: the possibility to compare the collocates of two related words (e.g  banana and apple or little and small; thanks to this we can know the difference in meaning of these words and how each word is used) , to find the frequency and distribution of synonyms for nearly 60,000 words and that we can create our own list of related words.

Take the following example that illustrates how this interface works. In this case, we will analyze the collocates that precede the nouns apple and banana.In the first chart, we can see the answers for apple. It can be seen that there are many times that apple is preceded by an article such as the or an.

WORD 1 (W1): APPLE (3.95)

    WORD W1 W2 W1/W2 SCORE
1   THE 1648 445 3.7 0.9
2   AN 1325 0 2,650.0 671.6

 However, banana has less cases. It could be said that apple takes normally determiners and banana not.

WORD 2 (W2): BANANA(0.25)

    WORD W2 W1 W2/W1 SCORE
1   A 602 8 75.3 296.9
2   THE 445 1648 0.3 1.1

Finally, it has to be said that if you use many times this interface, you will have to Log in. Do not hesitate to use this corpus and find attach here a video done by the Emerald Cultural Institute that shows very well how to use COCA .

References:


Calendar

May 2011
F S S M T W T
« Mar    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Categories

About

RSS CiteULike

RSS Rss Planet Littera

  • Critical Essay: Focus on form VS. Focus on meaning 11 February 2014
    Focus on form refers to instruction that focuses learners’ attention to linguistic structure within a meaningful context. In what ways do Swain (1985) and Van Patten (1990) provide empirical support for focus on form as opposed to an exclusive focus ...
    albagutierrez
  • Chinese New Year customs 6 February 2014
    Chinese New Year, also known as the Spring Festival is the main Chinese festival of the year. Reporters as Lauren Mack and Rose Mathews have written about Chinese traditional New Year celebration. In China it is customary to offer foods … Continu...
    Ainhoa Dárceles Romaratezabala
  • DIDIER SORNETTE: “CRISES ARE PREDICTABLE” 6 February 2014
    “Crises are predictable” said the economist Didier Sornette at TEDGlobal 2013. According to the risk economist we have been living by the illusions produced by the high economic growth that had […]
    Marina Fernández
  • Inundaciones del 83; el renacer del “Bocho” 5 February 2014
    Las inundaciones que sufrió Vizcaya, pero especialmente Bilbao, en agosto de 1983 marcaron un antes y un después en la historia “del bocho”. Las causas de esta terrible catástrofe se centran en la llamada “gota fría” que supone una precipitación abundante y ráfagas de viento huracanas que en ocasiones pueden ir acompañadas de tormentas eléctricas […] […]
    Tamara López Martín
  • Types of tourists 5 February 2014
    As we all know, tourist is the person who travels away from her home and spends more than 24 hours and less than a year in a certain place. People can travel for business and professional reasons, for leisure and holiday … Continue reading →
    Paula Gutierrez