Sunday, September 6, 2015

Deleting a page of a pdf file in Ubuntu

Suppose you have a pdf file with many pages and you want to delete a single page (or a list of pages) from this file in Ubuntu. You can first install the pdftk package using Ubuntu's package manager or by simply typing

sudo apt-get install pdftk

in the command line.

Now think that you want 4th page to be excluded from the pdf. You can type

pdftk old.pdf cat 1-3 5-end output new.pdf

where the original pdf is old.pdf, new.pdf is the new one with 4th page is excluded.

Hope this helps.

Wednesday, September 2, 2015

Parallel Numeric Integration using Python Multiprocessing Library

Parallel algorithms can divide a problem on several cores. Some algorithms can easly divide a problem into smaller ones such as matrix calculation and numerical function integration because divided parts of these operations are independent each other, that is, integrating a function for a < x < b and b < x < c domains equals to integrating the same function for a < x < c, when the divided numerical integrals are summed up.

In Python, the class Thread can be extended by a user defined class for multi-threaded applications. However, because of the interpreter lock, a multi-threaded solution can take longer computation times even if the computer has more than one cores on the cpu. The solution is using the multiprocessing library.

Suppose that we have the standard normal distribution function coded in Python:


 def normal(x):  
   return ((1/math.sqrt(2*math.pi)) * math.exp(-0.5 * math.pow(x,2)))  
   
   


Now we define an integrator function that takes the function, the bounds of the domain and a Queue object for storing the result as parameters:


 def Integrator(func, a, b, lstresults):  
   epsilon = 0.00001  
   mysum = 0.0  
   i = float(a)  
   while(i < b):  
     mysum += func(i) * epsilon  
     i += epsilon  
   lstresults.put(mysum)  
   
   


The function Integrator simply integrates the given function from a to b. This function uses a single core even the computer has more. However, this operation can be divided into several smaller parts. Now we define a MultipleIntegrator class to distribute these operation into n parts, where n is a user defined integer, optionally equals to number of threads.


 class MultipleIntegrator:  
   def __init__(self):  
     "Init"  
   
   def do(self, func, a, b, cores):  
     self.cores = cores  
     integrationrange = (float(b) - float(a)) /float(cores)  
     self.integrators = [None]*cores  
     mya = float(a)  
     allresults = Queue()  
     result = 0.0  
     for i in range(0,cores):  
       self.integrators[i] = Process(target=Integrator, args=(func, mya, mya + integrationrange,allresults,))  
       self.integrators[i].start()  
       mya += integrationrange  
   
     for myIntegrator in self.integrators:  
       myIntegrator.join()  
   
   
     while(not allresults.empty()):  
       result += allresults.get()  
   
     return(result)  
   


When we instantiate this class and call the do function on a user defined function, the integration process of func will be shared on several cores.
Lets integrate the normal function from -4 to 4 using 2 threads:


 m = MultipleIntegrator()  
 print(m.do(normal, -10, 10,2))  
   


This result and the computation time is:

1.0000039893377686

real    0m1.762s
user    0m3.384s
sys     0m0.024s



 If we increase the number of threads to 4:


 m = MultipleIntegrator()  
 print(m.do(normal, -10, 10,4))  
   


The computation time is reported as:



1.0000039894435513

real    0m1.364s
user    0m5.056s
sys     0m0.028s




 which is reduced by the multi-processing. The whole example is given below:


 from multiprocessing import Process,Queue  
 import math  
   
   
 def normal(x):  
     return ((1/math.sqrt(2*math.pi)) * math.exp(-0.5 * math.pow(x,2)))  
   
   
 def Integrator(func, a, b, lstresults):  
     epsilon = 0.00001  
     mysum = 0.0  
     i = float(a)  
     while(i < b):  
         mysum += func(i) * epsilon  
         i += epsilon  
     lstresults.put(mysum)  
   
   
 class MultipleIntegrator:  
     def __init__(self):  
         "Init"      
   
     def do(self, func, a, b, cores):  
         self.cores = cores  
         integrationrange = (float(b) - float(a)) /float(cores)  
         self.integrators = [None]*cores  
         mya = float(a)  
         allresults = Queue()  
         result = 0.0  
         for i in range(0,cores):  
             self.integrators[i] = Process(target=Integrator, args=(func, mya, mya + integrationrange,allresults,))  
             self.integrators[i].start()  
             mya += integrationrange  
   
         for myIntegrator in self.integrators:  
             myIntegrator.join()  
   
               
         while(not allresults.empty()):  
             result += allresults.get()  
   
         return(result)  
   
               
   
       
   
 m = MultipleIntegrator()  
 print(m.do(normal, -10, 10,4))  
   
   



Note that, the standard normal function is a probability density function and can not be integrated to a number bigger than 1. All the results include many floating-point rounding errors. Many corrections can be added to Integrator for rounding issues.

Hope you get fun!

Sunday, August 30, 2015

Overloading Constructor in Python Classes

In Python, the constructor __init__ can not be overloaded. If you are a Java or C++ programmer, you probably used this facility before because even the standard libraries of these languages use function overloading.

Altough the language does not support constructor overloading, we can follow the factory design pattern for using multiple constructors in Python. In factory design pattern, we define static class members that create new instances with given parameters. In Python, if a function labeled with a @classmethod, then this function belongs to the class rather than an object. This is the same as the static methods in Java.

Now think that we want to write a Python class with three constructors. In first constructor we want to set param1 to a specific value and others to zero. In second constructor we want to set param2 to a specific value and others to zero and goes on.

 class Clazz:  
   def __init__(self):  
     "please use create1, create2 or create3"  
   
   @classmethod  
   def create1(cls, param1):  
     c = Clazz()  
     c.param1 = param1  
     c.param2 = 0  
     c.param3 = 0  
     return c  
   
   @classmethod  
   def create2(cls, param2):  
     c = Clazz()  
     c.param1 = 0  
     c.param2 = param2  
     c.param3 = 0  
     return c  
   
   @classmethod  
   def create3(cls,param3):  
     c = Clazz()  
     c.param1 = 0  
     c.param2 = 0  
     c.param3 = param3  
     return c  
   
   def dump(self):  
     print("Param1 = %d, Param2 = %d, Param3 = %d" %  
       (self.param1,self.param2,self.param3))  



Of course we don't need to set other parameters to zero in all static methods. Lets get the code more compact:


 class Clazz:  
   
   param1 = 0  
   param2 = 0  
   param3 = 0  
   
   def __init__(self):  
     "please use create1, create2 or create3"  
   
   @classmethod  
   def create1(cls, param1):  
     c = Clazz()  
     c.param1 = param1  
     return c  
   
   @classmethod  
   def create2(cls, param2):  
     c = Clazz()  
     c.param2 = param2  
     return c  
   
   @classmethod  
   def create3(cls,param3):  
     c = Clazz()  
     c.param3 = param3  
     return c  
   
   def dump(self):  
     print("Param1 = %d, Param2 = %d, Param3 = %d" %  
       (self.param1,self.param2,self.param3))  
   
   



And now, we can instantiate some objects from this class using different factory methods:


 myc1 = Clazz.create1(5)  
 myc1.dump()  
   
 myc2 = Clazz.create2(10)  
 myc2.dump()  
   
 myc3 = Clazz.create3(50)  
 myc3.dump()  
   
 myc4 = Clazz()  
 myc4.dump()  
   


myc1 calls the first static factory method. As it is expected, only the value of the first parameter is changed. Following methods sets the other parameters only. The output is

 Param1 = 5, Param2 = 0, Param3 = 0  
 Param1 = 0, Param2 = 10, Param3 = 0  
 Param1 = 0, Param2 = 0, Param3 = 50  
 Param1 = 0, Param2 = 0, Param3 = 0  


As we seen at the last line of the output, the object myc4 is created using the __init__constructor and all the parameters have value of zero.

Hope you get fun!

Tuesday, August 25, 2015

Lisp for the C++ programmer: Numerical Integration

In the series of Lisp for the C++ programmer, we present some Common Lisp examples in a way similar to examples which are written in C++ or some other languages belong the same family tree of C++.

Here is the example of Riemann sum in Common Lisp. We first define a function f which takes a single parameter x and the function y = f(x) = x, for simplicity. We will change this function later. It is known that integration of this function from 0 to 1 is 0.50.

Common Lisp code for numerical integration (approximatly result) is:



 (defun f (x)  
     x  
 )  
 (defun integrate (f start stop)  
     (setq epsilon 0.0001)  
     (setq sum 0.0)  
     (loop for i from start to stop by epsilon do  
         (setq sum (+ sum (* (funcall f i) epsilon)))  
     )  
 sum  
 )  
 (print (integrate 'f 0 1))  

The result is 0.4999532. Now we can use a more complex function, for example a normal distribution function with zero mean and unit variance. This function can be defined in Common Lisp as


 (defun normal (x)  
     (setq a (* (/ 1 (sqrt (* 2 3.141592))) (exp (* -0.5 (* x x)))))  
 a  
 )  
 (defun integrate (f start stop)  
     (setq epsilon 0.0001)  
     (setq sum 0.0)  
     (loop for i from start to stop by epsilon do  
         (setq sum (+ sum (* (funcall f i) epsilon)))  
     )  
 sum  
 )  
 (print (integrate 'normal -1 1))  



In the code above, as it can clearly be seen, we integrate the standard normal distribution from -1 to 1 and the result is 0.6826645.

Lisp for the C++ programmer: Changing an element of a list

Here is the example of changing an element of a Common Lisp List and its equivalent code in C++ typed as List comment.






 ; #include <cstdlib>  
 ; #include <cstdio>  
 ;  
 ; int main(){  
 ;    double *d = (double*) malloc (sizeof(double) * 3);  
 ;    d[0] = 1;  
 ;    d[1] = 2;  
 ;    d[2] = 4;  
 ;  
 ;    puts("Current List:");  
 ;    printf("%f %f %f\n", d[0], d[1], d[2]);  
 ;  
 ;    puts("New List:");  
 ;    d[0] = 100;  
 ;    printf("%f %f %f\n", d[0], d[1], d[2]);  
 ;    return(0);  
 ;}  
   
   
 (setq aList (list 1 2 3))  
 (print "Current List:")  
 (print aList)  
   
 (setf (elt aList 0) 100)  
 (print "New List:")  
 (print aList)  
   
   

Lisp for the C++ programmer: for loop

Here is the example of for loop. Equivalent C++ code is commented on the top the Common Lisp code as comments.








 ; for (int i = 0; i <= 10; i++){  
 ;    printf("%f\n", i);  
 ;    printf("%f\n", i * 2);  
 ; }  
   
 (loop for i from 0 to 10  
     do  
     (progn  
         (print i)  
         (print (* i 2))  
     )  
 )  
   

Lisp for the C++ programmer: Function definitions and function calls

Here is the simple Common Lisp example, in which a sum and an ArithmaticMean functions defined. Both C++ and Common Lisp code here calculate the arithmetic mean of 1,2,3,4,5 and 6 which is 3.5. C++ code is commented as in the Common Lisp file.





 ; double sum (double *d, int len){  
 ;    double mysum = 0.0;  
 ;    for (int i = 0; i < len; i++){  
 ;        mysum += d[i];  
 ;    }  
 ;    return(mysum);  
 ; }  
 ;  
 ; double ArithmeticMean (double *d, int len){  
 ;    return ( sum(d) / len );  
 ; }  
   
 ; int main(){  
 ;    double *mylist = (double*) malloc(sizeof(double) * 6);  
 ;    for (int i = 0; i < 6; i++){  
 ;        d[i] = (double)i;  
 ;    }  
 ;    printf("%f\n", ArithmeticMean(d, 6);  
 ;    return(0); 
 ; }  
   
   
   
 (defun sum (aList)  
     (setq mysum 0.0)  
     (dotimes (i (length aList))  
         (setq mysum (+ mysum (nth i aList )))  
     )  
     mysum  
 )  
   
 (defun ArithmeticMean (aList)  
     (/ (sum aList) (length aList))  
 )  
   
 (print (ArithmeticMean '(1 2 3 4 5 6)))  
   

Lisp for the C++ programmer: cond expression

Here is the example for the cond expression of Common Lisp and its C++ equivalent.







 ; int x = 10;  
 ; if (x < 10) {  
 ;    puts("x is smaller than 10");  
 ; } else if (x > 10){  
 ;    puts("x is bigger than 10");  
 ; } else if (x == 10){  
 ;    puts("x equals to 10"));  
 ; }  
   
 (setq x 10)  
   
 (cond  
     ((< x 10) (print "x is smaller than 10"))  
     ((> x 10) (print "x is bigger than 10"))  
     ((= x 10) (print "x equals to 10"))  
 )  
   

Wednesday, March 25, 2015

Singleton Design Pattern in PHP Example

Singleton Pattern
Hello! As we all know that, design patterns used to solve any issue in software engineering. All of design patterns have typical skills. For example, in the singleton pattern a class can distribute one instance of itself to other classes. In this article I'll show you how to design a PDO database class using singleton design pattern in PHP programming language.





//singleton design pattern
class DB {
 
 private static $status = NULL;
 private static $info = 'mysql:host=localhost;dbname=dbname';
 private static $username = 'username';
 private static $password = 'password';
 
 private function __construct() { }
 
 public static function connect() {
  if(!self::$status) {
   try {
    self::$status = new PDO(self::$info, self::$username, self::$password);
    self::$status->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
    self::$status->setAttribute(PDO::ATTR_EMULATE_PREPARES, false);
   }
   catch (PDOException $e) {
    throw new Exception($e->GetMessage()); 
   }
  }
  return self::$status;
 }
 
 public function __destruct() {
  self::$status = NULL; 
 }
}

First, We define status as a static variable, because we want to control whether connection is exists via only one variable. Aftet that define database information as private static variables. After all we create our connection static function, that is connect().

Here is the way is actually, if status variable is not null, code try to connect PDO database, otherwise connection is exists and will return true,NOT null.
Finally, in destruct function, connection has been closed by null value. When we want to connect database, we call the singleton like:
DB::connect();
If class has any other function, then like:
DB::connect()->functionName($param);

See you!

Sunday, March 22, 2015

Introduction to Fuzuli : JFuzuli REPL


JFuzuli is the JVM implementation of our programming language Fuzuli which is based on LISP syntax and Algol family programming logic. Fuzuli is a modern collaboration of these two separate family of languages.

Let's try JFuzuli:




1. Download the Jar 

The current compiled jar of JFuzuli interpreter is release candidate 1.0. You can download it using the link https://github.com/jbytecode/fuzuli/releases/tag/v1.0_release_candidate. You can always find the newest releases in site JFuzuli Releases.

2. Open the Command Prompt

After downloading the jar file, open your operation system's command prompt and locate the jar file by using cd (change directory) command.

3. Start trying it!

In command prompt, type

java -jar JFuzuli.jar

to start. You will see the options:

Usage:
java -jar JFuzuli.jar fzlfile
java -jar JFuzuli.jar --repl
java -jar JFuzuli.jar --editor


You can specify a fuzuli source file to run. The option --repl opens a command shell.  The last option --editor opens the GUI.  Let's try the command shell. 

java -jar JFuzuli.jar --repl
F: 

The prompt F: waits for a convenient Fuzuli expression. Now we can try some basic commands:

F: (+ 2 7)
9.0
F: (- 7 10)
-3.0
F: (require "lang.nfl")
0.0
F: (let mylist '(1 2 3))
[1.0, 2.0, 3.0]
F: (first mylist)
1.0
F: (last mylist)
3.0
F: (length mylist)
3
F: (nth mylist 0)
1.0
F: (nth mylist 1)
2.0


Well, we introduce some basic operators, data types and commands here but not all of them. We always put an operator or command after an opening parenthesis, arguments follow this operator and a closing parenthesis takes place. This is the well-known syntax of LISP and Scheme. So what is the language properties, what are the commands, how to try more Fuzuli codes in JFuzuli??

Fuzuli Language home page: http://fuzuliproject.org/
Have a nice read!


Thursday, March 19, 2015

Why is R awesome?

For someone it is a magic, somebody hates its notation (maybe you!),  it has some weird rules and maybe it is just a programming language like others (That is also my opinion). As the other programming languages, R has its good and bad properties but I can say it is the best candidate as a toolbox of a statistician or researchers who work on data analysis.

In this blog post, I collect 8 (from 0 to 7) nice properties of R. As a lecturer and researcher, I experienced that many students are more capable to understand some statistical concepts when I try to show and get them work using Monte Carlo simulations.  In R, we are able to write compact codes to demonstrate these concepts which would be difficult to implement in an other programming language. R is not a simple toy, so we are always capable to enhance our knowledge, programming skills and get capabilities of writing better codes by introducing external codes that are written in real programming languages (an old joke of real man which uses C).


So, if it is, why is R awesome ?



0. Syntax of Algol Family

R has a weird assign operator but the remaining part is similar to Algol family languages such as C, C++, Java and C#.  R has a similar facility of operator overloading (yes, it is not exactly the operator overloading), in other terms, single or compound character of symbols can be assigned to function names like this:


> '%_%' <- function(a,b){
+    return(exp(a+b))
+ }
> 5 %_% 2
[1] 1096.633


1. Vectors are primitive data types

Yes, vectors are also primitives with an opening and a closing bracket in other members of Algol. In C/C++ they are arrays of primitives and objects in Java. Contrary this, binary operators are directly applicable on the vectors and matrices in R.  For example estimation of least squares coefficients is a single line expression in R as:


> assign("x",cbind(1,1:30))
> assign("y",3+3*x[,2]+rnorm(30))
> solve(t(x) %*% x) %*% t(x) %*% y
         [,1]
[1,] 2.858916
[2,] 3.003787

This example shows the differences between a scaler and a vector:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
> assign("x", c(1,2,3))
> assign("a", 5)
> typeof(x)
[1] "double"
> typeof(a)
[1] "double"
> class(x)
[1] "numeric"
> class(a)
[1] "numeric"

No difference!


2. Theorems get alive in minutes

Suppose that X is a random variable that follows an Exponential Distribution with ratio = 5.
Sum or mean of randomly selected samples with size of N follows a normal distribution.  This is an explanation of the Central Limit Theorem with an example. Theorems are theorems. But you may see a fast demonstration (and probably a proof for educational purposes only) and try to write a rapid application. A process of writing a code like this takes minutes if you use R.


> assign("nsamp", 5000)
> assign("n", 100)
> assign("theta", 5.0)
> assign("sums", rep(0,nsamp))
> 
> for (i in 1:nsamp){
+     sums[i] <- sum(rexp(n = n, rate = theta)) 
+ }
> hist(sums)




3. There is always a second plan for faster code

Now suppose that we are drawing 50,000 samples randomly using the code above. What would be the computation time?


> assign("nsamp", 50000)
> assign("n", 100)
> assign("theta", 5.0)
> assign("sums", rep(0,nsamp))
> 
> s <- system.time(
+     for (i in 1:nsamp){
+         sums[i] <- sum(rexp(n = n, rate = theta)) 
+     }
+ )
> 
> print(s)
   user  system elapsed 
  0.582   0.000   0.572 




Drawing 50,000 samples with size 100 takes 0.582 seconds. Is it now fast enough? Lets try to write it in C++ !


#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
NumericVector CalculateRandomSums(int m, int n) {
   NumericVector result(m);
   int i;
   for (i = 0; i < m; i++){
     result[i] = sum(rexp(n, 5.0));
   }
   return(result);
}


After compiling the code within Rcpp, we can call the function CalculateRandomSums() from R.


> s <- system.time(
+ vect <- calculaterandomsums(50000,100)
> print(s)
   user  system elapsed 
  0.185   0.000   0.184 

Now our R code is 3.145946 times slower than the code written in C++.


4. Interaction with C/C++/Fortran is enjoyable

Since a huge amount of R is written in C, migration of old C libraries is easy by writing wrapper methods using SEXP data types. Rcpp masks these routines in a clever way. Fortran code is also
linkable. Interaction with other languages makes use of old libraries in R and enables the possibility of writing faster new libraries.  It is also possible to create instances of R in C and C++ applications.
For an enjoyable example, have a look at the section 3. There is always a second plan for faster code.
The R package eive includes a small portion of C++ code and it is a compact example of calling C++ functions from within R. Accessing C++ objects from R is also possible thank to Rcpp. Click here to see the explanation and an example.


5. Interaction with Java

Calling Java from R (rJava) and calling R from Java (JRI, RCaller) are all possible. Renjin has a different concept as it is the R interpreter written in Java (Another possibility of calling R from Java , huh?).  A detailed comparison of these method is given in this documentation and this.


6. Sophisticated variable scoping

In R, functions have their own variable scopes and accessing variables at the top level is possible. Addition to this, variable scoping is handled by standard R lists (specially they are called environments) and in any side of code user based environments can be created. For detailed information visit Environment in R.


7. Optional Object Oriented Programming (O-OOP) 

R functions take values of variables as parameters rather than their addresses. If a vector with size of 10,0000 is passed through a function, R first copies this vector then passes it to the function. After body of the function is performed, the copied parameter is then labeled as free for later garbage collecting. As C/C++ programmers know, passing objects with their addresses rather than their values is a good solution for using less memory and spending less computation time. Reference classes in R are passed to functions with their addresses in a way similar to passing C++ references and Java objects to functions and methods:



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Person <- setRefClass(
    Class = "Person",
    fields = c("name","surname","email"),
    methods = list(
        initialize = function(name, surname, email){
            .self$name <- name
            .self$surname <- surname
            .self$email <- email
        },
        
        setName = function(name){
            .self$name <- name
        },
        
        setSurname = function(surname){
            .self$surname <- surname
        },
        
        setEMail = function (email){
            .self$email <- email
        },
        
        toString = function (){
            return(paste(name, " ", surname, " ", email))
        }   
    ) # End of methods
) # End of class



p <- Person$new("John","Brown","brown@server.org")
print(p$toString())

The output is

[1] "John   Brown   brown@server.org"

Java and C++ programmers probably like this notation!


Have a nice read!


Monday, March 16, 2015

Compact Genetic Algorithms with R


Compact Genetic Algorithm (CGA) is a member of Genetic Algorithms (GAs) and also Estimation of Distribution Algorithms (EDAs). Since it is based on a single chromosome rather than a population of chromosomes, it is compact.

For detailed information, research papers [1] and [2] present a complete and a brief documentations, respectively.

In this blog post, we give an example of use of compact genetic algorithms on ONEMAX function. ONEMAX function takes n-bits as parameters and returns the number of ones as integer. Since it is only one local optimum when all of the bits equal to 1, it is called ONEMAX.

First of all, we load the R package eive which includes the wrapped C++ function cga.

> require("eive")

The other step is to define the ONEMAX function.

> ONEMAX <- function (x){
+     return(-sum(x))
+ }

Now we write the main part, optimization with cga:

> result <- cga(chsize = 10 , popsize = 100 , evalFunc = ONEMAX)
> result
 [1] 1 1 1 1 1 1 1 1 1 1

The result is a vector in which the bits are all equal to 1.

The most important issue in this example is speed, because the algorithm is implemented in C++ and wrapped using Rcpp to be called within R.

Here is the example of 1000 bits and the time consumed by the cga function call:

> system.time(
+     result <- cga(chsize = 1000,popsize = 100,evalFunc = ONEMAX)
+ )
   user  system elapsed 
  0.443   0.000   0.433 
> ONEMAX(result)
[1] -994

This result seems to be considerably fast and 994 of 1000 bits are found as 1 by the function in 0.433 seconds. Lets increase the population size from 100 to 200:

> system.time(
+     result <- cga(chsize = 1000,popsize = 200,evalFunc = ONEMAX)
+ )
   user  system elapsed 
  0.891   0.000   0.866 
> print (ONEMAX(result))
[1] -1000

Now, after setting the population size from 100 to 200, function doubles the time consumed to 0.866 seconds. But this time, 1000 of 1000 bits are 1, and the optimal solution is reached.

Have a nice read !



[1] Harik, Georges R., Fernando G. Lobo, and David E. Goldberg. "The compact genetic algorithm." Evolutionary Computation, IEEE Transactions on 3.4 (1999): 287-297.

[2] Satman, M. Hakan, and Erkin Diyarbakirlioglu. "Reducing errors-in-variables bias in linear regression using compact genetic algorithms." Journal of Statistical Computation and Simulation ahead-of-print (2014): 1-20.


Accessing C++ objects from R using Rcpp

Rcpp (Seemless R and C++ integration) package for R provides an easy way of combining C++ and R code. Since R is an interpreter, a bulk of code would probably run at least 2 times slower than its counterpart written in C++. Speed is the most concerning issue many times, however, the main purpose of using C++ would be using an old native library with R.

In this post blog, we give an example of accessing a C++ class from within R using Rcpp. This C++ class is defined with name MyClass and has two private double typed variables. This class also has getter and setter methods for its private fields.

MyClass is defined as the code shown below:


#include <Rcpp.h>

using namespace Rcpp;
using namespace std;


class MyClass {
  private:
    double a,b;
  
  public:
    MyClass(double a, double b);
    ~MyClass();
    void setA(double a);
    void setB(double b);
    double getA();
    double getB();
};



MyClass has its private double typed variables a and b, a constructor, a destructor, getter and setter methods for a and b, respectively. The implementation of MyClass is given below:




MyClass::MyClass(double a, double b){
    this->a = a;
    this->b = b;
}

MyClass::~MyClass(){
    cout << "Destructor called" << std::endl;
}

void MyClass::setA (double a){
    this->a = a;
}

void MyClass::setB (double b){
    this->b = b;
}

double MyClass::getA(){
    return(this->a);
}

double MyClass::getB(){
    return(this->b);
}


MyClass is defined nearly minimal. Since it is a C++ class it is not directly accessable from R. In this example, we write some wrapper methods to create instances of MyClass and return their addresses in memory to perform later function calls. In other terms, in R side, we register address of C++ objects to access them. 


// [[Rcpp::export]]
long class_create(double a, double b){
    MyClass *m =  new MyClass(a,b);
    class_print((long) m);
    return((long)m);
}

The method class_create is a C++ method and it has a special comment which will be used by Rcpp before compiling. After compiling process, class_create wrapper R function will be created to call its C++ counterpart. This function create an instance of class_create with given double typed values and returns the address of created object in type long integer.  Here is the other wrapper functions:


// [[Rcpp::export]]
void class_print(long addr){
    MyClass *m = (MyClass*)addr;
    cout << "a = " << m->getA() << " b = " << m->getB() << "\n";   
}


// [[Rcpp::export]]
void class_destroy(long addr){
    MyClass *m = (MyClass*) addr;
    delete m;
}

// [[Rcpp::export]]
void class_set_a(long addr, double a){
    MyClass *m = (MyClass*) addr;
    m->setA(a);
}

// [[Rcpp::export]]
void class_set_b(long addr, double b){
    MyClass *m = (MyClass*) addr;
    m->setB(b);
}

// [[Rcpp::export]]
double class_get_a(long addr){
    MyClass *m = (MyClass*) addr;
    return(m->getA());
}

// [[Rcpp::export]]
double class_get_b(long addr){
    MyClass *m = (MyClass*) addr;
    return(m->getB());
}

Suppose the whole code is written in a file classcall.cpp.  In R side, this code can be compiled and tested as shown below:

> require("Rcpp")
Loading required package: Rcpp
> Rcpp::sourceCpp('rprojects/classcall.cpp')
> myobj <- class_create(3.14, 7.8)
a = 3.14 b = 7.8
> myobj
[1] 104078752
> class_set_a(myobj,100)
> class_set_b(myobj,500)
> class_print(myobj)
a = 100 b = 500

> class_get_a(myobj)
[1] 100
> class_get_b(myobj)
[1] 500
> class_destroy(myobj)
Destructor called


Have a nice read!


Saturday, March 14, 2015

SQLite with R - The sqldf package


R 's data sorting functions sort and order, the data filtering function which, vector accessing operators [], vector and matrix manipulation functions cbind and rbind, and other functions and keywords make data analysis easy in much situations. SQL (Structered Querying Language) is used for storing, adding, removing, sorting and filtering the data in which saved on a disk permenantly or memory.

The R package sqldf builds a SQLite database using an R data.frame object. A data.frame is a matrix with richer properties in R.  In this blog post, we present a basic introduction of sqldf package and its use in R.

First of all, the package can be installed by typing:

> install.packages("dftable")

After installing the package, it can be got ready to use by typing:

> require("dftable")
Loading required package: sqldf
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: DBI


Now lets create two vectors with length of 100:

> assign("x", rnorm(100))
> assign("y", rnorm(100))
> assign("mydata", as.data.frame(cbind(x,y)))

We can see first 6 rows:

> head(mydata)
           x         y
1 -1.9357660 0.2784369
2 -0.6976428 1.4646022
3  0.1913628 0.1578977
4  0.3049607 0.6055087
5  2.3773249 1.1800434
6  0.4641791 1.7143130

Let's perform some SQL statements on this data frame using sqldf

Averages of x and y

> sqldf("select avg(x), avg(y) from mydata")
     avg(x)   avg(y)
1 0.0790934 0.220756


Number of cases

> sqldf("select count(x), count(y) from mydata")
  count(x) count(y)
1      100      100


First Three Cases

> sqldf("select x,y from mydata limit 3")
           x         y
1 -1.9357660 0.2784369
2 -0.6976428 1.4646022
3  0.1913628 0.1578977


Minimum and Maximum Values

> sqldf("select min(x),max(x),min(y),max(y) from mydata")
     min(x)   max(x)   min(y)   max(y)
1 -2.155768 2.377325 -1.75477 2.531869


First 3 Cases of Ordered Data

> sqldf("select x,y from mydata order by x limit 3")
          x         y
1 -2.155768 0.6614813
2 -1.935766 0.2784369
3 -1.837502 0.1073177
> sqldf("select x,y from mydata order by y limit 3")
          x         y
1 0.7665811 -1.754770
2 0.3373319 -1.736727
3 0.6199159 -1.335649


Insert into 

dftable does not alter the data frame. After inserting a new case, a new data.frame is created and returned. In the example below, sqldf takes a vector of two sql statements as parameters and the result is in accessable with the name main.mydata rather than mydata

> tail (sqldf( 
+ c(
+ "insert into mydata values (6,7)"
+ ,
+ "select * from main.mydata"
+ )
+ )
+ )
              x          y
96   1.58024523  1.3937920
97  -1.79352203  0.2105787
98   0.02632872 -1.0567890
99  -0.60934162 -0.1359667
100  1.43393159 -0.9396326
101  6.00000000  7.0000000


Delete

> sqldf( 
+ c(
+ "delete from mydata where x < 0 or y < 0"
+ ,
+ "select * from main.mydata"
+ )
+ )
            x          y
1  0.19136277 0.15789771
2  0.30496074 0.60550873
3  2.37732485 1.18004342
4  0.46417906 1.71431305
5  1.16290585 1.17154756
6  0.49335335 0.19904607
7  1.45769371 0.08291387
8  0.78473338 1.07769098
9  0.69043300 1.35040512
10 1.47893118 1.01057351
.....


Have a nice read!