Showing posts with label rcpp. Show all posts
Showing posts with label rcpp. Show all posts

Thursday, March 19, 2015

Why is R awesome?

For someone it is a magic, somebody hates its notation (maybe you!),  it has some weird rules and maybe it is just a programming language like others (That is also my opinion). As the other programming languages, R has its good and bad properties but I can say it is the best candidate as a toolbox of a statistician or researchers who work on data analysis.

In this blog post, I collect 8 (from 0 to 7) nice properties of R. As a lecturer and researcher, I experienced that many students are more capable to understand some statistical concepts when I try to show and get them work using Monte Carlo simulations.  In R, we are able to write compact codes to demonstrate these concepts which would be difficult to implement in an other programming language. R is not a simple toy, so we are always capable to enhance our knowledge, programming skills and get capabilities of writing better codes by introducing external codes that are written in real programming languages (an old joke of real man which uses C).


So, if it is, why is R awesome ?



0. Syntax of Algol Family

R has a weird assign operator but the remaining part is similar to Algol family languages such as C, C++, Java and C#.  R has a similar facility of operator overloading (yes, it is not exactly the operator overloading), in other terms, single or compound character of symbols can be assigned to function names like this:


> '%_%' <- function(a,b){
+    return(exp(a+b))
+ }
> 5 %_% 2
[1] 1096.633


1. Vectors are primitive data types

Yes, vectors are also primitives with an opening and a closing bracket in other members of Algol. In C/C++ they are arrays of primitives and objects in Java. Contrary this, binary operators are directly applicable on the vectors and matrices in R.  For example estimation of least squares coefficients is a single line expression in R as:


> assign("x",cbind(1,1:30))
> assign("y",3+3*x[,2]+rnorm(30))
> solve(t(x) %*% x) %*% t(x) %*% y
         [,1]
[1,] 2.858916
[2,] 3.003787

This example shows the differences between a scaler and a vector:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
> assign("x", c(1,2,3))
> assign("a", 5)
> typeof(x)
[1] "double"
> typeof(a)
[1] "double"
> class(x)
[1] "numeric"
> class(a)
[1] "numeric"

No difference!


2. Theorems get alive in minutes

Suppose that X is a random variable that follows an Exponential Distribution with ratio = 5.
Sum or mean of randomly selected samples with size of N follows a normal distribution.  This is an explanation of the Central Limit Theorem with an example. Theorems are theorems. But you may see a fast demonstration (and probably a proof for educational purposes only) and try to write a rapid application. A process of writing a code like this takes minutes if you use R.


> assign("nsamp", 5000)
> assign("n", 100)
> assign("theta", 5.0)
> assign("sums", rep(0,nsamp))
> 
> for (i in 1:nsamp){
+     sums[i] <- sum(rexp(n = n, rate = theta)) 
+ }
> hist(sums)




3. There is always a second plan for faster code

Now suppose that we are drawing 50,000 samples randomly using the code above. What would be the computation time?


> assign("nsamp", 50000)
> assign("n", 100)
> assign("theta", 5.0)
> assign("sums", rep(0,nsamp))
> 
> s <- system.time(
+     for (i in 1:nsamp){
+         sums[i] <- sum(rexp(n = n, rate = theta)) 
+     }
+ )
> 
> print(s)
   user  system elapsed 
  0.582   0.000   0.572 




Drawing 50,000 samples with size 100 takes 0.582 seconds. Is it now fast enough? Lets try to write it in C++ !


#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
NumericVector CalculateRandomSums(int m, int n) {
   NumericVector result(m);
   int i;
   for (i = 0; i < m; i++){
     result[i] = sum(rexp(n, 5.0));
   }
   return(result);
}


After compiling the code within Rcpp, we can call the function CalculateRandomSums() from R.


> s <- system.time(
+ vect <- calculaterandomsums(50000,100)
> print(s)
   user  system elapsed 
  0.185   0.000   0.184 

Now our R code is 3.145946 times slower than the code written in C++.


4. Interaction with C/C++/Fortran is enjoyable

Since a huge amount of R is written in C, migration of old C libraries is easy by writing wrapper methods using SEXP data types. Rcpp masks these routines in a clever way. Fortran code is also
linkable. Interaction with other languages makes use of old libraries in R and enables the possibility of writing faster new libraries.  It is also possible to create instances of R in C and C++ applications.
For an enjoyable example, have a look at the section 3. There is always a second plan for faster code.
The R package eive includes a small portion of C++ code and it is a compact example of calling C++ functions from within R. Accessing C++ objects from R is also possible thank to Rcpp. Click here to see the explanation and an example.


5. Interaction with Java

Calling Java from R (rJava) and calling R from Java (JRI, RCaller) are all possible. Renjin has a different concept as it is the R interpreter written in Java (Another possibility of calling R from Java , huh?).  A detailed comparison of these method is given in this documentation and this.


6. Sophisticated variable scoping

In R, functions have their own variable scopes and accessing variables at the top level is possible. Addition to this, variable scoping is handled by standard R lists (specially they are called environments) and in any side of code user based environments can be created. For detailed information visit Environment in R.


7. Optional Object Oriented Programming (O-OOP) 

R functions take values of variables as parameters rather than their addresses. If a vector with size of 10,0000 is passed through a function, R first copies this vector then passes it to the function. After body of the function is performed, the copied parameter is then labeled as free for later garbage collecting. As C/C++ programmers know, passing objects with their addresses rather than their values is a good solution for using less memory and spending less computation time. Reference classes in R are passed to functions with their addresses in a way similar to passing C++ references and Java objects to functions and methods:



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Person <- setRefClass(
    Class = "Person",
    fields = c("name","surname","email"),
    methods = list(
        initialize = function(name, surname, email){
            .self$name <- name
            .self$surname <- surname
            .self$email <- email
        },
        
        setName = function(name){
            .self$name <- name
        },
        
        setSurname = function(surname){
            .self$surname <- surname
        },
        
        setEMail = function (email){
            .self$email <- email
        },
        
        toString = function (){
            return(paste(name, " ", surname, " ", email))
        }   
    ) # End of methods
) # End of class



p <- Person$new("John","Brown","brown@server.org")
print(p$toString())

The output is

[1] "John   Brown   brown@server.org"

Java and C++ programmers probably like this notation!


Have a nice read!


Monday, March 16, 2015

Compact Genetic Algorithms with R


Compact Genetic Algorithm (CGA) is a member of Genetic Algorithms (GAs) and also Estimation of Distribution Algorithms (EDAs). Since it is based on a single chromosome rather than a population of chromosomes, it is compact.

For detailed information, research papers [1] and [2] present a complete and a brief documentations, respectively.

In this blog post, we give an example of use of compact genetic algorithms on ONEMAX function. ONEMAX function takes n-bits as parameters and returns the number of ones as integer. Since it is only one local optimum when all of the bits equal to 1, it is called ONEMAX.

First of all, we load the R package eive which includes the wrapped C++ function cga.

> require("eive")

The other step is to define the ONEMAX function.

> ONEMAX <- function (x){
+     return(-sum(x))
+ }

Now we write the main part, optimization with cga:

> result <- cga(chsize = 10 , popsize = 100 , evalFunc = ONEMAX)
> result
 [1] 1 1 1 1 1 1 1 1 1 1

The result is a vector in which the bits are all equal to 1.

The most important issue in this example is speed, because the algorithm is implemented in C++ and wrapped using Rcpp to be called within R.

Here is the example of 1000 bits and the time consumed by the cga function call:

> system.time(
+     result <- cga(chsize = 1000,popsize = 100,evalFunc = ONEMAX)
+ )
   user  system elapsed 
  0.443   0.000   0.433 
> ONEMAX(result)
[1] -994

This result seems to be considerably fast and 994 of 1000 bits are found as 1 by the function in 0.433 seconds. Lets increase the population size from 100 to 200:

> system.time(
+     result <- cga(chsize = 1000,popsize = 200,evalFunc = ONEMAX)
+ )
   user  system elapsed 
  0.891   0.000   0.866 
> print (ONEMAX(result))
[1] -1000

Now, after setting the population size from 100 to 200, function doubles the time consumed to 0.866 seconds. But this time, 1000 of 1000 bits are 1, and the optimal solution is reached.

Have a nice read !



[1] Harik, Georges R., Fernando G. Lobo, and David E. Goldberg. "The compact genetic algorithm." Evolutionary Computation, IEEE Transactions on 3.4 (1999): 287-297.

[2] Satman, M. Hakan, and Erkin Diyarbakirlioglu. "Reducing errors-in-variables bias in linear regression using compact genetic algorithms." Journal of Statistical Computation and Simulation ahead-of-print (2014): 1-20.


Accessing C++ objects from R using Rcpp

Rcpp (Seemless R and C++ integration) package for R provides an easy way of combining C++ and R code. Since R is an interpreter, a bulk of code would probably run at least 2 times slower than its counterpart written in C++. Speed is the most concerning issue many times, however, the main purpose of using C++ would be using an old native library with R.

In this post blog, we give an example of accessing a C++ class from within R using Rcpp. This C++ class is defined with name MyClass and has two private double typed variables. This class also has getter and setter methods for its private fields.

MyClass is defined as the code shown below:


#include <Rcpp.h>

using namespace Rcpp;
using namespace std;


class MyClass {
  private:
    double a,b;
  
  public:
    MyClass(double a, double b);
    ~MyClass();
    void setA(double a);
    void setB(double b);
    double getA();
    double getB();
};



MyClass has its private double typed variables a and b, a constructor, a destructor, getter and setter methods for a and b, respectively. The implementation of MyClass is given below:




MyClass::MyClass(double a, double b){
    this->a = a;
    this->b = b;
}

MyClass::~MyClass(){
    cout << "Destructor called" << std::endl;
}

void MyClass::setA (double a){
    this->a = a;
}

void MyClass::setB (double b){
    this->b = b;
}

double MyClass::getA(){
    return(this->a);
}

double MyClass::getB(){
    return(this->b);
}


MyClass is defined nearly minimal. Since it is a C++ class it is not directly accessable from R. In this example, we write some wrapper methods to create instances of MyClass and return their addresses in memory to perform later function calls. In other terms, in R side, we register address of C++ objects to access them. 


// [[Rcpp::export]]
long class_create(double a, double b){
    MyClass *m =  new MyClass(a,b);
    class_print((long) m);
    return((long)m);
}

The method class_create is a C++ method and it has a special comment which will be used by Rcpp before compiling. After compiling process, class_create wrapper R function will be created to call its C++ counterpart. This function create an instance of class_create with given double typed values and returns the address of created object in type long integer.  Here is the other wrapper functions:


// [[Rcpp::export]]
void class_print(long addr){
    MyClass *m = (MyClass*)addr;
    cout << "a = " << m->getA() << " b = " << m->getB() << "\n";   
}


// [[Rcpp::export]]
void class_destroy(long addr){
    MyClass *m = (MyClass*) addr;
    delete m;
}

// [[Rcpp::export]]
void class_set_a(long addr, double a){
    MyClass *m = (MyClass*) addr;
    m->setA(a);
}

// [[Rcpp::export]]
void class_set_b(long addr, double b){
    MyClass *m = (MyClass*) addr;
    m->setB(b);
}

// [[Rcpp::export]]
double class_get_a(long addr){
    MyClass *m = (MyClass*) addr;
    return(m->getA());
}

// [[Rcpp::export]]
double class_get_b(long addr){
    MyClass *m = (MyClass*) addr;
    return(m->getB());
}

Suppose the whole code is written in a file classcall.cpp.  In R side, this code can be compiled and tested as shown below:

> require("Rcpp")
Loading required package: Rcpp
> Rcpp::sourceCpp('rprojects/classcall.cpp')
> myobj <- class_create(3.14, 7.8)
a = 3.14 b = 7.8
> myobj
[1] 104078752
> class_set_a(myobj,100)
> class_set_b(myobj,500)
> class_print(myobj)
a = 100 b = 500

> class_get_a(myobj)
[1] 100
> class_get_b(myobj)
[1] 500
> class_destroy(myobj)
Destructor called


Have a nice read!