Monday, August 19, 2013

Linear Regression Revisited



If she loves you more each and every day, by linear regression she hated you before you met.






- Your theory is wrong!
- Out, liar! 


Sunday, August 18, 2013

Econometrics Beat: Dave Giles' Blog: Large and Small Regression Coefficients

Econometrics Beat: Dave Giles' Blog: Large and Small Regression Coefficients: Here's a trap that newbies to regression analysis have been known to fall into. It's to do with comparing the numerical values of t...

Generating LaTeX Tables in R

Generating LATEXTables in R

jbytecode

August 18, 2013

Many R users prepare reports, documents or research papers using LATEXtype-setting system. Since typing tabular data structures in LATEXby hand consumes too much construction time, some packages were developed for automatic creation or conversion of R datasets. In this short entry, we show the usage and output of xtable function in xtable package. First install the package if it is not already installed:

> install.packages(”xtable”)

After installation, the package is loaded once in the current session:

> require(”xtable”)

Now, suppose that we have a data frame and we want to export it as a LATEXtable. Let’s create a random dataset with three variables, x,y and z.

> x<-round(runif(10,0,100)) 
> y<-round(runif(10,0,100)) 
> z<-round(runif(10,0,100)) 
> data <- as.data.frame ( cbind(x,y,z) ) 
> data 
    x   y  z 
1  26  86 44 
2  81  13 22 
3  39  27 57 
4  32  57 56 
5  50  15 31 
6  34 100 98 
7  54  46 24 
8  42  55 42 
9  62  91 77 
10  5  73 25

Calling the function xtable simply on the data that we have just created produces some output:

> xtable(data)
% latex table generated in R 2.15.3 by xtable 1.7-1 package  
% Sun Aug 18 23:40:23 2013  
\begin{table}[ht]  
\centering  
\begin{tabular}{rrrr}  
  \hline  
 & x & y & z \\  
  \hline  
1 & 26.00 & 86.00 & 44.00 \\  
  2 & 81.00 & 13.00 & 22.00 \\  
  3 & 39.00 & 27.00 & 57.00 \\  
  4 & 32.00 & 57.00 & 56.00 \\  
  5 & 50.00 & 15.00 & 31.00 \\  
  6 & 34.00 & 100.00 & 98.00 \\  
  7 & 54.00 & 46.00 & 24.00 \\  
  8 & 42.00 & 55.00 & 42.00 \\  
  9 & 62.00 & 91.00 & 77.00 \\  
  10 & 5.00 & 73.00 & 25.00 \\  
   \hline  
\end{tabular}  
\end{table}

The generated output can easly be integrated with an LATEXfile. This table is shown as






x y z




126.00 86.0044.00
281.00 13.0022.00
339.00 27.0057.00
432.00 57.0056.00
550.00 15.0031.00
634.00100.0098.00
754.00 46.0024.00
842.00 55.0042.00
962.00 91.0077.00
10 5.00 73.0025.00





This is an example of xtable function call with default parameters. In the next example, we put a caption.

> xtable(data,caption=”Our random dataset”)






x y z




126.00 86.0044.00
281.00 13.0022.00
339.00 27.0057.00
432.00 57.0056.00
550.00 15.0031.00
634.00100.0098.00
754.00 46.0024.00
842.00 55.0042.00
962.00 91.0077.00
10 5.00 73.0025.00





Table 1: Our random dataset

Let’s put a label on it:

> xtable(data,caption=”Our random dataset”, 
                          label=”This is a label”)






x y z




126.00 86.0044.00
281.00 13.0022.00
339.00 27.0057.00
432.00 57.0056.00
550.00 15.0031.00
634.00100.0098.00
754.00 46.0024.00
842.00 55.0042.00
962.00 91.0077.00
10 5.00 73.0025.00





Table 2: Our random dataset

And this function call shows the numbers with three fractional digits.

> xtable(data,caption=”Our random dataset”, 
                label=”This is a label”, 
                digits=3)






x y z




126.000 86.00044.000
281.000 13.00022.000
339.000 27.00057.000
432.000 57.00056.000
550.000 15.00031.000
634.000100.00098.000
754.000 46.00024.000
842.000 55.00042.000
962.000 91.00077.000
10 5.000 73.00025.000





Table 3: Our random dataset

Sorting Multi-Column Datasets in R

Sorting Multi-Column Datasets in R


August 18, 2013

In this entry, we present the most straightforward way to sort multi-column datasets

Suppose that we have a vector in three dimensional space. This vector can be defined as

<- c(3,1,2)

in R. The built-in function sort sorts a given vector in ascending order by default. An ordered version of vector a is calculated as shown below:

sorted_<- sort(a)

The result is

[1] 1 2 3

For reverse ordering, the value of decreasing parameter must be set to TRUE.

> sort(a,decreasing=TRUE) 
[1] 3 2 1

Another related function order returns indices of sorted elements.

> a <- c(3,1,2) 
> order (a) 
[1] 2 3 1

In the example above, it is shown that if the vector a is ordered in ascending order, second element of a will be placed at index 1. It is easy to show that, result of the function order can be used to sort data.

> a <- c(3,1,2) 
> o <- order (a) 
> a[o] 
[1] 1 2 3

Suppose that we need to sort a matrix by a desired row. The solution is easy. Get the indices of sorted desired column and use the same method as in example given above.

> x <- round(runif(5, 0, 100)) 
> y <- round(runif(5, 0, 100)) 
> z <- round(runif(5, 0, 100)) 
> data <- cbind(x,y,z) 
> data 
      x  y  z 
[1,] 48 35 75 
[2,] 40 21 43 
[3,] 58 69  1 
[4,] 49 38  2 
[5,] 43 66 46

Now, get the indices of sorted z:

> o <- order(z) 
> data[o,] 
      x  y  z 
[1,] 58 69  1 
[2,] 49 38  2 
[3,] 40 21 43 
[4,] 43 66 46 
[5,] 48 35 75

Finally, we sorted the dataset data by the column vector z.

Saturday, August 17, 2013

A User Document For RCaller

A new research paper as a RCaller documentation is freely available at http://www.sciencedomain.org/abstract.php?iid=550&id=6&aid=4838#.U5YSoPmSy1Y




RCaller: A library for calling R from Java

by M.Hakan Satman

August 17, 2013

Contents


Abstract

RCaller is an open-source, compact, and easy-to-use library for calling R from Java. It offers not only an elegant solution for the task but its simplicity is key for non-programmers or programmers who are not familier with the internal structure of R. Since R is not only a statistical software but an enormous collection of statistical functions, accessing its functions and packages is of tremendous value. In this short paper, we give a brief introduction on the most widely-used methods to call R from Java and highlight some properties of RCaller with short examples. User feedback has shown that RCaller is an important tool in many cases where performance is not a central concern.

1 Introduction


R [R Development Core Team(2011)] is an open source and freely distributed statistics software package for which hundreds of external packages are available. The core functionality of R is written mostly in C and wrapped by R functions which simplify parameter passing. Since R manages the exhaustive dynamic library loading tasks in a clever way, calling an external compiled function is easy as calling an R function in R. However, integration with JVM (Java Virtual Machine) languages is painful.
The R package rJava [Urbanek(2011a)] provides a useful mechanism for instantiating Java objects, accessing class elements and passing R objects to Java methods in R. This library is convenient for the R packages that rely on external functionality written in Java rather than C, C++ or Fortran.
The library JRI, which is now a part of the package rJava, uses JNI (Java Native Interface) to call R from Java [Urbanek(2009)]. Although JNI is the most common way of accessing native libraries in Java, JRI requires that several system and environment variables are correctly set before any run, which can be difficult for inexperienced users, especially those who are not computer scientists.
The package Rserve [Urbanek(2011b)] uses TCP sockets and acts as a TCP server. A client establishes a connection to Rserve, sends R commands, and receives the results. This way of calling R from the other platforms is more general because the handshaking and the protocol initializing is fully platform independent.
Renjin (http://code.google.com/p/renjin) is an other interesting project that addresses the problem. It solves the problem of calling R from Java by re-implementing the R interpreter in Java! With this definition, the project includes the tasks of writing the interpreter and implementing the internals. Renjin is intended to be 100% compatible with the original. However, it is under development and needs help. After all, an online demo is available which is updated simultaneously when the source code is updated.
Finally, RCaller [RCaller Development Team(2011)] is an LGPL’d library which is very easy to use. It does not do much but wraps the operations well. It requires no configuration beyond installing an R package (Runiversal) and locating the Rscript binary distributed with R. Altough it is known to be relatively inefficient compared to other options, its latest release features significant performance improvements.

2 Calling R Functions


Calling R code from other languages is not trivial. R includes a huge collection of math and statistics libraries with nearly 700 internal functions and hundreds of external packages. No comparable library exists in Java. Although libraries such as the Apache Commons Math [Commons Math Developers(2010)] do provide many classes for those calculations, its scope is quite limited compared to R. For example, it is not easy to find such a library that calculates quantiles and probabilities of non-central distributions. [Harner et al.(2009)Harner, Luo, and Tan] affirms that using R’s functionality from Java prevents the user from writing duplicative codes in statistics softwares.
RCaller is an other open source library for performing R operations from within Java applications in a wrapped way. RCaller prepares R code using the user input. The user input is generally a Java array, a plain Java object or the R code itself. It then creates an external R process by running the Rscript executable. It passes the generated R code and receives the output as XML documents. While the process is alive, the output of the standard input and the standard error streams are handled by an event-driven mechanism. The returned XML document is then parsed and the returned R objects are extracted to Java arrays.
The short example given below creates two double vectors, passes them to R, and returns the residuals calculated from a linear regression estimation.
RCaller caller = new RCaller();
RCode code = new RCode();
double[] xvector = new double[]{1,3,5,3,2,4};
double[] yvector = new double[]{6,7,5,6,5,6};

caller.setRscriptExecutable("/usr/bin/Rscript");

code.addDoubleArray("X", xvector);
code.addDoubleArray("Y", yvector);
code.addRCode("ols <- lm ( Y ~ X )");

caller.setRCode(code);

caller.runAndReturnResult("ols");

double[] residuals =
   caller.getParser().
     getAsDoubleArray("residuals");  

The lm function returns an R list with a class of lm whose elements are accessible with the $ operator. The method runAndReturnResult() takes the name of an R list which contains the desired results. Finally, the method getAsDoubleArray() returns a double vector with values filled from the vector residuals of the list ols.
RCaller uses the R package Runiversal [Satman(2010)] to convert R lists to XML documents within the R process. This package includes the method makexml() which takes an R list as input and returns a string of XML document. Although some R functions return the results in other types and classes of data, those results can be returned to the JVM indirectly. Suppose that obj is an S4 object with members member1 and member2. These members are accessible with the @ operator like obj@member1 and obj@member2. These elements can be returned to Java by constructing a new list like result\A1-list(m1=obj@member1, m2=obj@member2).

3 Handling Plots


Although the graphics drivers and the internals are implemented in C, most of the graphics functions and packages are written in the R language and this makes the R unique with its graphics library. RCaller handles a plot with the function startPlot() and receives a java.io.File reference to the generated plot. The function getPlot() returns an instance of the javax.swing.ImageIcon class which contains the generated image in a fully isolated way. A Java example is shown below:
RCaller caller = new RCaller();
RCode code = new RCode();
File plotFile = null;
ImageIcon plotImage = null;

caller.
setRscriptExecutable("/usr/bin/Rscript");

code.R_require("lattice");

try{
 plotFile = code.startPlot();
 code.addRCode("
      xyplot(rnorm(100)~1:100, type=’l’)
      ");
}catch (IOException err){
 System.out.println("Can not create plot");
}

caller.setRCode(code);
caller.runOnly();

plotImage = code.getPlot(plotFile);
code.showPlot(plotFile);

The method runOnly() is quite different from the method RunAndReturnResult(). Because the user only wants a plot to be generated, there is nothing returned by R in the example above. Note that more than one plots can be generated in a single run.
Handling R plots with a java.io.File reference is also convenient in web projects. Generated content can be easly sent to clients using output streams opened from the file reference. However, RCaller uses the temp directory and does not delete the generated files automatically. This may be a cause of a too many files OS level error which can not be caught by a Java program. However, cleaning the generated output using a scheduled task solves this problem.

4 Live Connection


Each time the method runAndReturnResult() is called, an Rscript instance is created to perform the operations. This is the main source of the inefficiency of RCaller. A better approach in the cases that R commands are repeatedly called is to use the method runAndReturnResultOnline(). This method creates an R instance and keeps it running in the background. This approach avoids the time required to create an external process, initialize the interpreter, and load packages in subsequent calls.
The example given below returns the determinants of a given matrix and its inverse in sequence, that is, it uses a single external instance to perform more than one operation.
double[][] matrix =
    new double[][]{{5,4,5},{6,1,0},{9,-1,2}};
caller.setRExecutable("/usr/bin/R");
caller.setRCode(code);

code.clear();
code.addDoubleMatrix("x", matrix);
code.addRCode("result<-list(d=det(x))");
caller.runAndReturnResultOnline("result");

System.out.println(
"Determinant is " +
  caller.getParser().
   getAsDoubleArray("d")[0]
   );

code.addRCode("result<-list(t=det(solve(x)))");
caller.runAndReturnResultOnline("result");

System.out.println(
"Determinant of inverse is " +
  caller.getParser().
   getAsDoubleArray("t")[0]
   );

This use of RCaller is fast and convenient for repeated commands. Since R is not thread-safe, its functions can not be called by more than one threads. Therefore, each single thread must create its own R process to perform calculations simultaneously in Java.

5 Monitoring the Output


RCaller receives the desired content as XML documents. The content is a list of the variables of interest which are manually created by the user or returned automatically by a function. Apart from the generated content, R produces some output to the standard output (stdout) and the standard error (stderr) devices. RCaller offers two options to handle these outputs. The first one is to save them in a text file. The other is to redirect all of the content to the standard output device. The example given below shows a conditional redirection of the outputs generated by R.
if(console){
 caller.redirectROutputToConsole();
}else{
 caller.redirectROutputToFile(
     "output.txt" /* filename */,
     true  /* append? */);
}

6 Conclusion


In addition to being a statistical software, R is an extendable library with its internal functions and external packages. Since the R interpreter was written mostly in C, linking to custom C/C++ programs is relatively simple. Unfortunately, calling R functions from Java is not straightforward. The prominent methods use JNI and TCP sockets to solve this problem. In addition, renjin offers a different perspective to this issue. It is a re-implementation of R in Java which is intended to be 100% compatible with the original. However, it is under development and needs help. Finally, RCaller is an alternative way of calling R from Java. It is packaged in a single jar and it does not require setup beyond the one-time installation of the R package Runiversal. It supports loading external packages, calling functions, handling plots and debugging the output generated by R. It is not the most efficient method compared to the alternatives, but users report that performance improvements in the latest revision and its simplicity of use make it an important tool in many applications.

References


[Commons Math Developers(2010)]   Commons Math Developers. Apache Commons Math, Release 2.1. Available from http://commons.apache.org/math/download_math.cgi, Apr. 2010. URL http://commons.apache.org/math.
[Harner et al.(2009)Harner, Luo, and Tan]   E. Harner, D. Luo, and J. Tan. JavaStat: A Java/R-based statistical computing environment. Computational Statistics, 24(2):295–302, May 2009.
[R Development Core Team(2011)]   R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. URL http://www.R-project.org/. ISBN 3-900051-07-0.
[RCaller Development Team(2011)]   RCaller Development Team. RCaller: A library for calling R from Java, 2011. URL http://code.google.com/p/rcaller.
[Satman(2010)]   M. H. Satman. Runiversal: A Package for converting R objects to Java variables and XML., 2010. URL http://CRAN.R-project.org/package=Runiversal. R package version 1.0.1.
[Urbanek(2009)]   S. Urbanek. How to talk to strangers: ways to leverage connectivity between R, Java and Objective C. Computational Statistics, 24(2):303–311, May 2009.
[Urbanek(2011a)]   S. Urbanek. rJava: Low-level R to Java interface, 2011a. URL http://CRAN.R-project.org/package=rJava. R package version 0.9-2.
[Urbanek(2011b)]   S. Urbanek. Rserve: Binary R server, 2011b. URL http://CRAN.R-project.org/package=Rserve. R package version 0.6-5.





A new research paper as a RCaller documentation is freely available at http://www.sciencedomain.org/abstract.php?iid=550&id=6&aid=4838#.U5YSoPmSy1Y