Friday, February 27, 2015

Frequency table of characters in a string in R

R's string manipulating functions includes splitting a string. One can think that parsing a string or extracting its characters into an array and generating the frequency information may be used in a language detection system.

Here is a example on a text that is captured from the Oracle - History of Java site. The code below defines a large string. The string is then parsed into its characters. After calculating the frequencies of each single character (including numbers, commas and dots) a histogram is saved in a file.






# Defining string 
s <- "Since 1995, Java has changed our world and our expectations. Today, with technology such a part of our daily lives, we take it for granted that we can be connected and access applications and content anywhere, anytime. Because of Java, we expect digital devices to be smarter, more functional, and way more entertaining. In the early 90s, extending the power of network computing to the activities of everyday life was a radical vision. In 1991, a small group of Sun engineers called the \"Green Team\" believed that the next wave in computing was the union of digital consumer devices and computers. Led by James Gosling, the team worked around the clock and created the programming language that would revolutionize our world – Java. The Green Team demonstrated their new language with an interactive, handheld home-entertainment controller that was originally targeted at the digital cable television industry. Unfortunately, the concept was much too advanced for the them at the time. But it was just right for the Internet, which was just starting to take off. In 1995, the team announced that the Netscape Navigator Internet browser would incorporate Java technology.Today, Java not only permeates the Internet, but also is the invisible force behind many of the applications and devices that power our day-to-day lives. From mobile phones to handheld devices, games and navigation systems to e-business solutions, Java is everywhere!"

# First converting to lower case
# then splitting by characters.
# strsplit return a list, we are unlisting to a vector.
chars <- unlist(strsplit(tolower(s), ""))

# Generating frequency table
freqs <- table(chars)

# Generating plot into a file
png("Graph.png")
hist(freqs,include.lowest=TRUE, breaks=46,freq=TRUE,labels=rownames(freqs))
dev.off()



The generated output is


We translate the text used in our example to Spanish using Google Translate site. The code is shown below:

# Defining string 
s <- "Desde 1995, Java ha cambiado nuestro mundo y nuestras expectativas. Hoy en día, con la tecnología de una parte de nuestra vida cotidia    na tal, damos por sentado que se puede conectar y acceder a las aplicaciones y contenido en cualquier lugar ya cualquier hora. Debido a Java    , esperamos que los dispositivos digitales para ser más inteligente, más funcional, y de manera más entretenida. A principios de los años 90    , que se extiende el poder de la computación en red para las actividades de la vida cotidiana era una visión radical. En 1991, un pequeño gr    upo de ingenieros de Sun llamado \"Green Team\" cree que la próxima ola de la informática fue la unión de los dispositivos digitales de cons    umo y ordenadores. Dirigido por James Gosling, el equipo trabajó durante todo el día y creó el lenguaje de programación que revolucionaría e    l mundo - Java. El Equipo Verde demostró su nuevo idioma con una mano controlador interactivo, el entretenimiento en casa que fue dirigido o    riginalmente a la industria de la televisión digital por cable. Por desgracia, el concepto fue demasiado avanzado para el ellos en el moment    o. Pero fue justo para Internet, que estaba empezando a despegar. En 1995, el equipo anunció que el navegador de Internet Netscape Navigator     incorporaría Java technology.Today, Java no sólo impregna el Internet, pero también es la fuerza invisible detrás de muchas de las aplicaci    ones y dispositivos que alimentan nuestra vida del día a día. Desde teléfonos móviles para dispositivos de mano, juegos y sistemas de navega    ción para e-business soluciones, Java está en todas partes!"

# First converting to lower case
# then splitting by characters.
# strsplit return a list, we are unlisting to a vector.
chars <- unlist(strsplit(tolower(s), ""))

# Generating frequency table
freqs <- table(chars)

# Generating plot into a file
png("Graph.png")
hist(freqs,include.lowest=TRUE, breaks=46,freq=TRUE,labels=rownames(freqs))
dev.off()


The generated plot for the Spanish translation is:


Note that some characters such as space, dots and commas can be replaced from the string s using a code similar to this:

news <- gsub(pattern=c(" "), "", x=s)

The code above removes all spaces from the string s and a new string variable news holds the modified string. Original variable remains same. 


Have a nice read!



No comments:

Post a Comment

Thanks