Posts Tagged 'Computer science'

COCA-Corpus of Contemporary American English

Nowadays, students of foreign languages, teachers or linguists have many tools available for learning new languages or improving their knowledge of that specific language they are studying. However, many people do not know of the existence of these tools and they cannot take advantage of them. Students can use translators, dictionaries, grammars… One tool that can be very useful when studying a language at a high level and how this language is structured is corpus linguistics. On the following lines, it will be described what is corpus linguistics and one specific corpus that has become very popular. This corpus is called The Corpus Of Contemporary American English (COCA) made by the important professor of  Corpus Linguistics Mark Davies at Brighman Young University.

For instance, What do we understand by Corpus linguistics? The definition by Wikipedia is the following:

 Corpus Linguistics is the study of language  as expressed in samples (corpora) or “real world” text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely derived by an automated process.

At first sight, it can seem that Corpus Linguistics is better to the study of  a language rather than grammars because in Corpus samples we have how the language is really used by native speakers. However, this system can also have some disadvantages. For example, as Noam Chomsky said, real language is also riddled with performance-related errors and that is why it is needed careful analysis of small speech samples , but this is not included in Corpus Linguistics because Linguists only include big examples. Nevertheless, this field has been improving and,nowadays, we have very good Corpus which include may samples and very well structured. One Corpus that has to be mentioned is the COCA one.

The Corpus Of  Contemporary American English is a free on-line corpus that has 425 million words and 160,000 different texts that come from a variety of sources and genres. It is the largest corpus of American English currently available.Moreover, it has been including 20 million words each year since 1990. More than 40,000 users visit this corpus each month. The different genres or sources are, firstly, spoken (85 million words) from 150 TV and radio programmes.Secondly, fiction (81 million words) from short stories and plays and, then, popular magazines (86 million words), newspapers (81 million words) and academic journals (81 million words). Furthermore, users can search the frequency of a word in each genre which help us to know, for example, if a word is used in academic writing or not. It is also possible to compare how the use of certain words has changed over time from 1990 to present time and to ignore one specific genre when we think that it is not going to be useful.

But, why is this Corpus so good? There are many reasons. For instance, researches of this corpus have been working many years to improve this corpus and their work is also connected to other important Corpus such as the British International Corpus, Time Corpus or the Corpus of Historical American English (COHA). There are also updates with new words from time to time; the last one has been in 2011. Users can search many things within the interface. For example, exact words (e.g: mysterious), part of speech, lemmas which are all the forms of a word (e.g: sing which is the base can have many forms such as singer, song, singing…), wildcard which is an option that gives you the system when you do not know exactly how a word is written( e.g: un*ly; the system’s answers would be unlikely, unusually…) It is also possible to search for collocates within a ten-word window (e.g. all nouns somewhere near faint, all adjectives near woman, or all verbs near feelings)

Other good points are: the possibility to compare the collocates of two related words (e.g  banana and apple or little and small; thanks to this we can know the difference in meaning of these words and how each word is used) , to find the frequency and distribution of synonyms for nearly 60,000 words and that we can create our own list of related words.

Take the following example that illustrates how this interface works. In this case, we will analyze the collocates that precede the nouns apple and banana.In the first chart, we can see the answers for apple. It can be seen that there are many times that apple is preceded by an article such as the or an.

WORD 1 (W1): APPLE (3.95)

    WORD W1 W2 W1/W2 SCORE
1   THE 1648 445 3.7 0.9
2   AN 1325 0 2,650.0 671.6

 However, banana has less cases. It could be said that apple takes normally determiners and banana not.

WORD 2 (W2): BANANA(0.25)

    WORD W2 W1 W2/W1 SCORE
1   A 602 8 75.3 296.9
2   THE 445 1648 0.3 1.1

Finally, it has to be said that if you use many times this interface, you will have to Log in. Do not hesitate to use this corpus and find attach here a video done by the Emerald Cultural Institute that shows very well how to use COCA .


Machine Translation.(Questionnaire 2 and questionnaire 3)

There are other research topics apart from Speaker Recognition and Computational Semantics. One that I think is really interesting is Machine Translation. This term has its origin in the 17th century when René Descartes (French philosopher, mathematician…) proposed a universal language. He wanted that the same ideas in different languages had the same symbol.

In these times, Machine Translation or MT is a field of computational linguistics that investigate how to translate text or speech from one natural language to another using computer software. It has to be said that the invention is extremely useful in areas where formal language is used such as legal or administrative documents. Nevertheless, when it is a colloquial or familiar text, the machine normally makes lots of mistakes because it translates the text word by word. We can see an example in the following picture.


Many computer scientists, linguists… have tried to improve this machine with more or less success. We could mention Hans Uszkoreit who I have talked about in a previous article. He worked in a machine translation project while he was staying at Austin. Moreover, there have being carried out several projects all over the world. For example, one in the National Centre for Language Technology in Ireland or the Norwegian-English Machine Translation in Norway.


Computational Semantics and Speaker Recognition. (Q2)

Nowadays, computer scientists are concerned about many topics. However, there are some issues which are more discussed. Computational Semantics has a great importance and there have been done lots of projects in this field. But what does Computational Semantics mean? Well, first of all, it has to be said that to understand it you have to know terms like semantics, linguistics, natural language Wikipedia defines it like this:

“Computational Semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural language processing and computational linguistics.”

In other words, the aim of this multidisciplinary field is to find techniques to write automatically semantic representations for expressions of human language. Indeed, we will be able to perform inference thanks to these representations.

Another important research topic is Speaker Recognition, this happens when the computer tries to recognize who is speaking. Mainly, it uses the features of speech that everybody has different like anatomy (size of the mouth, shape of the throat…) and learned behavioural patterns (speaking style, voice pitch…). It is similar to Speech Recognition. Nevertheless, it has nothing to do with that. Speech Recognition is the fact of recognizing what is being said.

Finally, I would like to say that one particularly project sticks in my mind with regard to speaker recognition. The name of the project is Secure Access Front-End and scientists are trying to improve the services access security without increasing the service complexity.


Hans Uszkoreit (Q1)

Hans Uszkoreit was born in Rostock (Germany) in 1950. He studied Linguistics and Computer Science at the Technical University of Berlin. While he was studying there, he worked as an editor and writer for the magazine Zitty. It has to be said that this magazine was co-founded by him. Then, he was given a Fulbright Grant and continued his studies at the University of Texas at Austin. The Fulbright Grant was created by J. William Fulbright (senator of Arkansas in the period of the Second World War) and the reason for creating this educational help was to make inhabitants from Europe and the United States understand one another better. It was also a good way to encourage tolerance and understanding between countries.

Returning to Hans Uszkoreit’s stay at Austin, not only did he study but he also worked in a machine translation project at the Linguistics Research Center. Finally, in 1984 he received the Ph.D. in linguistics from the University of Texas. Since then, he has been doing several things. Indeed, he is member of a wide range of associations (European Academy of Sciences, European Network of Language and Speech …)


Nowadays, the scholar is a Professor of Computational Linguistics at Saarland University. Apart from serving as Scientific Director at the German Research Center for Artificial Intelligence (DFKI) where he heads the DFKI Language Technology Lab. Moreover, he has written a lot of publications and he has also written poems in English and in German. In conclusion, Hans Uszkoreit is an excellent professor and researcher, so people should take him into account when talking about linguistics and computer science.

Yorick Wilks (Q1)

If you are involved in the world of computer science, you should know about Yorick Wilks.  He is a computer scientist who was born in the United Kingdom in the year 1939. He is married and has four children. Firstly, he went to the Torquay Boys’ Grammar School, which is a prestigious single sex school situated in Devon. Then, the scholar went to the University of Cambridge where he received his M.A. and his PhD in the year 1968.


But why is Yorick Wilks so important in this field? Well, the answer is quite easy. First of all, he has done a lot of projects in relation to the understanding of natural language content by computers. Moreover, he has written several publications throughout his life and his career is extraordinary.  Nowadays, he is a professor at the University Of Sheffield, where he directs the Institute for Language, Speech and Hearing.

Finally, it has to be said that there is an excellent file about him in Wikipedia where one can read about his biography or his main publications, apart from other things. Do not hesitate to visit it.


Research Centres for Human Language Technologies.(Q1)

There are lots of research centres for Human Language Technologies all over the world. For instance, I am going to explain the meaning of Human Language Technology. It is also called Language Technology or Natural Language Processing. Mainly, it is a field of computer science and its aim is the interaction between machines and people.

With regard to the research centres, they do a great job with their projects. I would like to mention some of them that have called my attention. To begin with, there is one in South Africa which is called Meraka Institute. This African Institute claims that Human Language Technology can help all the people, from illiterate farmers to scientists. They are carrying out a project that involves a lot of researchers who study how to develop this technology to benefit the people of southern Africa.

HLT Group- Meraka Institute.

HLT Group- Meraka Institute.

Another important research centre is the German Research Center for Artificial Intelligence. They work for improving language technology with the help of novel computational techniques. Basically, they are investigating in three different areas: Information and Knowledge Management, Document Production and Natural Communication.

Finally, this last research centre is carrying out many projects at the moment such as DILIA (The Intelligent Library Assistant) or ConQA (Controlled Semantic- based Question Answering) apart from its commercial activities like the indexing of German and English texts using a software package.




July 2017
« May    



RSS CiteULike

RSS Rss Planet Littera

  • Critical Essay: Focus on form VS. Focus on meaning 11 February 2014
    Focus on form refers to instruction that focuses learners’ attention to linguistic structure within a meaningful context. In what ways do Swain (1985) and Van Patten (1990) provide empirical support for focus on form as opposed to an exclusive focus ...
  • Chinese New Year customs 6 February 2014
    Chinese New Year, also known as the Spring Festival is the main Chinese festival of the year. Reporters as Lauren Mack and Rose Mathews have written about Chinese traditional New Year celebration. In China it is customary to offer foods … Continu...
    Ainhoa Dárceles Romaratezabala
    “Crises are predictable” said the economist Didier Sornette at TEDGlobal 2013. According to the risk economist we have been living by the illusions produced by the high economic growth that had […]
    Marina Fernández
  • Inundaciones del 83; el renacer del “Bocho” 5 February 2014
    Las inundaciones que sufrió Vizcaya, pero especialmente Bilbao, en agosto de 1983 marcaron un antes y un después en la historia “del bocho”. Las causas de esta terrible catástrofe se centran en la llamada “gota fría” que supone una precipitación abundante y ráfagas de viento huracanas que en ocasiones pueden ir acompañadas de tormentas eléctricas […] […]
    Tamara López Martín
  • Types of tourists 5 February 2014
    As we all know, tourist is the person who travels away from her home and spends more than 24 hours and less than a year in a certain place. People can travel for business and professional reasons, for leisure and holiday … Continue reading →
    Paula Gutierrez