Pull the (character) strings with stringi 0.5-2

A reliable string processing toolkit is a must-have for any data scientist.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds). As for now, about 850 CRAN packages depend (either directly or recursively) on stringi. And quite recently, the package got listed among the top downloaded R extensions.

# install.packages("stringi") or update.packages()
library("stringi")
stri_info(TRUE)
## [1] "stringi_0.5.2; en_US.UTF-8; ICU4C 55.1; Unicode 7.0"
apkg <- available.packages(contriburl="http://cran.rstudio.com/src/contrib")
length(tools::dependsOnPkgs('stringi', installed=apkg, recursive=TRUE))
## [1] 845

Refer to the INSTALL file for more details if you compile stringi from sources (Linux users mostly).

Here’s a list of changes in version 0.5-2. There are many major (like date&time processing) and minor new features, enhancements, as well as bugfixes. In the current release we also focused on bringing stringr package’s users even better string processing experience, as since the 1.0.0 release it is now powered by stringi.

  • [BACKWARD INCOMPATIBILITY] The second argument to stri_pad_*() has been renamed width.

  • [GENERAL] #69: stringi is now bundled with ICU4C 55.1.

  • [NEW FUNCTIONS] #137: date-time formatting/parsing (note that this is draft API and it may change in future stringi releases; any comments are welcome):
    • stri_timezone_list() – lists all known time zone identifiers
    sample(stri_timezone_list(), 10)
    ##  [1] "Etc/GMT+12"                  "Antarctica/Macquarie"       
    ##  [3] "Atlantic/Faroe"              "Antarctica/Troll"           
    ##  [5] "America/Fort_Wayne"          "PLT"                        
    ##  [7] "America/Goose_Bay"           "America/Argentina/Catamarca"
    ##  [9] "Africa/Juba"                 "Africa/Bissau"
    • stri_timezone_set(), stri_timezone_get() – manage current default time zone
    • stri_timezone_info() – basic information on a given time zone
    str(stri_timezone_info('Europe/Warsaw'))
    ## List of 6
    ##  $ ID              : chr "Europe/Warsaw"
    ##  $ Name            : chr "Central European Standard Time"
    ##  $ Name.Daylight   : chr "Central European Summer Time"
    ##  $ Name.Windows    : chr "Central European Standard Time"
    ##  $ RawOffset       : num 1
    ##  $ UsesDaylightTime: logi TRUE
    stri_timezone_info('Europe/Warsaw', locale='de_DE')$Name
    ## [1] "Mitteleuropäische Normalzeit"
    • stri_datetime_symbols() – localizable date-time formatting data
    stri_datetime_symbols()
    ## $Month
    ##  [1] "January"   "February"  "March"     "April"     "May"      
    ##  [6] "June"      "July"      "August"    "September" "October"  
    ## [11] "November"  "December" 
    ## 
    ## $Weekday
    ## [1] "Sunday"    "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"   
    ## [7] "Saturday" 
    ## 
    ## $Quarter
    ## [1] "1st quarter" "2nd quarter" "3rd quarter" "4th quarter"
    ## 
    ## $AmPm
    ## [1] "AM" "PM"
    ## 
    ## $Era
    ## [1] "Before Christ" "Anno Domini"
    stri_datetime_symbols("th_TH_TRADITIONAL")$Month
    ##  [1] "มกราคม"  "กุมภาพันธ์"    "มีนาคม"    "เมษายน"  "พฤษภาคม" "มิถุนายน"    "กรกฎาคม"
    ##  [8] "สิงหาคม"   "กันยายน"   "ตุลาคม"    "พฤศจิกายน" "ธันวาคม"
    stri_datetime_symbols("he_IL@calendar=hebrew")$Month
    ##  [1] "תשרי"   "חשון"   "כסלו"   "טבת"    "שבט"    "אדר א׳" "אדר"   
    ##  [8] "ניסן"   "אייר"   "סיון"   "תמוז"   "אב"     "אלול"   "אדר ב׳"
    • stri_datetime_now() – return current date-time
    • stri_datetime_fstr() – convert a strptime-like format string to an ICU date/time format string
    • stri_datetime_format() – convert date/time to string
        stri_datetime_format(stri_datetime_now(), "datetime_relative_medium")
    ## [1] "today, 6:21:45 PM"
    • stri_datetime_parse() – convert string to date/time object
    stri_datetime_parse(c("2015-02-28", "2015-02-29"), "yyyy-MM-dd")
    ## [1] "2015-02-28 18:21:45 CET" NA
    stri_datetime_parse(c("2015-02-28", "2015-02-29"), stri_datetime_fstr("%Y-%m-%d"))
    ## [1] "2015-02-28 18:21:45 CET" NA
    stri_datetime_parse(c("2015-02-28", "2015-02-29"), "yyyy-MM-dd", lenient=TRUE)
    ## [1] "2015-02-28 18:21:45 CET" "2015-03-01 18:21:45 CET"
    stri_datetime_parse("19 lipca 2015", "date_long", locale="pl_PL")
    ## [1] "2015-07-19 18:21:45 CEST"
    • stri_datetime_create() – construct date-time objects from numeric representations
    stri_datetime_create(2015, 12, 31, 23, 59, 59.999)
    ## [1] "2015-12-31 23:59:59 CET"
    stri_datetime_create(5775, 8, 1, locale="@calendar=hebrew") # 1 Nisan 5775 -> 2015-03-21
    ## [1] "2015-03-21 12:00:00 CET"
    stri_datetime_create(2015, 02, 29)
    ## [1] NA
    stri_datetime_create(2015, 02, 29, lenient=TRUE)
    ## [1] "2015-03-01 12:00:00 CET"
    • stri_datetime_fields() – get values for date-time fields
    stri_datetime_fields(stri_datetime_now())
    ##   Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth
    ## 1 2015     6  23   18     21     45          52         26           4
    ##   DayOfYear DayOfWeek Hour12 AmPm Era
    ## 1       174         3      6    2   2
       stri_datetime_fields(stri_datetime_now(), locale="@calendar=hebrew")
    ##   Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth
    ## 1 5775    11   6   18     21     45          56         40           2
    ##   DayOfYear DayOfWeek Hour12 AmPm Era
    ## 1       272         3      6    2   1
       stri_datetime_symbols(locale="@calendar=hebrew")$Month[
      stri_datetime_fields(stri_datetime_now(), locale="@calendar=hebrew")$Month
       ]
    ## [1] "Tamuz"
    • stri_datetime_add() – add specific number of date-time units to a date-time object
    x <- stri_datetime_create(2015, 12, 31, 23, 59, 59.999)
    stri_datetime_add(x, units="months") <- 2
    print(x)
    ## [1] "2016-02-29 23:59:59 CET"
    stri_datetime_add(x, -2, units="months")
    ## [1] "2015-12-29 23:59:59 CET"
  • [NEW FUNCTIONS] stri_extract_*_boundaries() extract text between text boundaries.

  • [NEW FUNCTION] #46: stri_trans_char() is a stringi-flavoured chartr() equivalent.

stri_trans_char("id.123", ".", "_")
## [1] "id_123"
stri_trans_char("babaab", "ab", "01")
## [1] "101001"
  • [NEW FUNCTION] #8: stri_width() approximates the width of a string in a more Unicodish fashion than nchar(..., "width")
stri_width(LETTERS[1:5])
## [1] 1 1 1 1 1
nchar(stri_trans_nfkd("\u0105"), "width") # provides incorrect information
## [1] 0
stri_width(stri_trans_nfkd("\u0105")) # A and ogonek (width = 1)
## [1] 1
stri_width( # Full-width equivalents of ASCII characters:
   stri_enc_fromutf32(as.list(c(0x3000, 0xFF01:0xFF5E)))
)
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
  • [NEW FEATURE] #149: stri_pad() and stri_wrap() now by default bases on code point widths instead of the number of code points. Moreover, the default behavior of stri_wrap() is now such that it does not get rid of non-breaking, zero width, etc. spaces
x <- stri_flatten(c(
   stri_dup(LETTERS, 2),
   stri_enc_fromutf32(as.list(0xFF21:0xFF3a))
), collapse=' ')
# Note that your web browser may have problems with properly aligning
# this (try it in RStudio)
cat(stri_wrap(x, 11), sep='\n')
## AA BB CC DD
## EE FF GG HH
## II JJ KK LL
## MM NN OO PP
## QQ RR SS TT
## UU VV WW XX
## YY ZZ A B
## C D E F
## G H I J
## K L M N
## O P Q R
## S T U V
## W X Y Z
  • [NEW FEATURE] #133: stri_wrap() silently allows for width <= 0 (for compatibility with strwrap()).

  • [NEW FEATURE] #139: stri_wrap() gained a new argument: whitespace_only.

  • [GENERAL] #144: Performance improvements in handling ASCII strings (these affect stri_sub(), stri_locate() and other string index-based operations)

  • [GENERAL] #143: Searching for short fixed patterns (stri_*_fixed()) now relies on the current libC’s implementation of strchr() and strstr(). This is very fast e.g. on glibc utilizing the SSE2/3/4 instruction set.

x <- stri_rand_strings(100, 10000, "[actg]")
microbenchmark::microbenchmark(
   stri_detect_fixed(x, "acgtgaa"),
   grepl("actggact", x),
   grepl("actggact", x, perl=TRUE),
   grepl("actggact", x, fixed=TRUE)
)
## Unit: microseconds
##                                expr       min        lq       mean
##     stri_detect_fixed(x, "acgtgaa")   349.153   354.181   381.2391
##                grepl("actggact", x) 14017.923 14181.416 14457.3996
##   grepl("actggact", x, perl = TRUE)  8280.282  8367.426  8516.0124
##  grepl("actggact", x, fixed = TRUE)  3599.200  3637.373  3726.6020
##      median         uq       max neval  cld
##    362.7515   391.0655   681.267   100 a   
##  14292.2815 14594.4970 15736.535   100    d
##   8463.4490  8570.0080  9564.503   100   c 
##   3686.6690  3753.4060  4402.397   100  b
  • [GENERAL] #141: a local copy of icudt*.zip may be used on package install; see the INSTALL file for more information.

  • [GENERAL] #165: the ./configure option --disable-icu-bundle forces the use of system ICU when building the package.

  • [BUGFIX] locale specifiers are now normalized in a more intelligent way: e.g. @calendar=gregorian expands to DEFAULT_LOCALE@calendar=gregorian.

  • [BUGFIX] #134: stri_extract_all_words() did not accept simplify=NA.

  • [BUGFIX] #132: incorrect behavior in stri_locate_regex() for matches of zero lengths.

  • [BUGFIX] stringr/#73: stri_wrap() returned CHARSXP instead of STRSXP on empty string input with simplify=FALSE argument.

  • [BUGFIX] #164: libicu-dev usage used to fail on Ubuntu.

  • [BUGFIX] #135: C++11 is now used by default (see the INSTALL file, however) to build stringi from sources. This is because ICU4C uses the long long type which is not part of the C++98 standard.

  • [BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.

  • [BUGFIX] #168: Build now fails if icudt is not available.

  • [BUGFIX] Force ICU u_init() call on stringi dynlib load.

  • [BUGFIX] #157: many overfull hboxes in the package PDF manual has been corrected.

Enjoy! Any comments and suggestions are welcome.

Tagged with:
Posted in Blog/News, Blog/R, Blog/R-bloggers

SimilaR

Introduction

Being a teacher can be a very gratifying job. If you teach programming, which is your favorite hobby too, nothing can be better than that. Only thing can spoil your dream: cheating students. As we all know, one can learn programming only by writing code him/herself. Copying source code of another student completely makes no sense, as student does not learn, and what is more, he/she gets points for something he/she didn’t make.

When there are only few homeworks to check, it is easy to do it manually. But what if there is a large number of submissions? Then we need some application to automate the process. There are some known tools for “standard” programming languages, such as MOSS or JPLAG for e.g. C, C++, C#, Java, Scheme or Javascript.

But what if we want to automate the process of checking similarity of R source code? Till now there were no such a tool available. But things have changed.

SimilaR

SimilaR is a service designed to detect similar source code patterns in the R language code snippets. To create an account, you got to possess an e-mail address in edu domain and prove us somehow that you’re a tutor (show us your webpage etc.). Once the account is activated you just upload your students’ submissions and wait a moment for the results.

Let see a working example. Assume that one student submitted the following file:


almostPi <- function(n)
{
# this is a function which approximate a Pi constant
stopifnot(is.numeric(n),length(n)==1,n>0,(n-floor(n))<=1e-8)

x <- runif(n,-1,1);
y <- runif(n,-1,1);
4*sum((sqrt(x^2+y^2))<=1)/n
}

pythagoreanTriples<-function(m,n)
{
stopifnot(length(m)==length(n));
a<-m^2-n^2;
b<-2*m*n;
c<-m^2+n^2;
mat<-matrix(c(a,b,c),3,length(a),byrow=TRUE); 
# I arrange triples in a matrix
l<-split(mat,col(mat));
l[[length(a)+1]] <- (a^2+b^2==c^2 & a*b*c!=0 & a*c>0)
return(l)
}

and the other one sent:


almostPi<-function (n=10000) {
stopifnot(length(n)==1,n>0,(n-floor(n))==0)
# Checking if n is a numeric vetor of length 1,
# and if it is a natural number
4*sum(sqrt(runif(n,-1,1)^2+runif(n,-1,1)^2)<=1)/n
}

pythagoreanTriples<-function(m,n){
stopifnot(is.vector(m),is.vector(n),is.numeric(c(m,n)),
length(n)==length(m),length(n)>0,all(c(n-floor(n),m-floor(m))<=1e-10),
all(c(m,n)>=0))

a<-m^2-n^2
b<-2*m*n
cc<-m^2+n^2
l<-mapply(c,a,b,cc,SIMPLIFY=FALSE)
l[length(l)+1]<-list(a^2+b^2==cc^2)
l
}

So we log into SimilaR, choose Antiplagiarism system -> New submission and we get a picture like:

In the area marked with a green rectangle we provide a name for a new submission. We can identify a group of files with this name. In the blue rectangle we choose what is the smallest group of functions (functions in one group are not compared): group of files, one file, or we compare every function with each other. Since every student in our example provide her homework in separate file, we choose a second option.

After we click Submit, we obtain:

In this view we can make sure that system understands uploaded files as we expect. If something is wrong, e.g. the source code has syntax errors, we will be notified at this step. Please note that there are no comments in source codes and a style of indentation is homogeneous. If everything is OK, we click Confirm button.

After that we see a list of our submissions. We can see a progress of our submission which is dynamically updated. When it is ready, it goes to a top of the list and we can see it.

Let us see the results. There are 4 pairs, as there were 2 functions in each file. The pairs are ordered from most similar to the least. In the beginning, we see only first 10 pairs, and we can assess every pair, if we believe it is similar or not. After evaluating some pairs (see green rectangle), we can see more of them. This solution is needed, as the system is based on some statistical learning algorithms and we need as many learning data as we can obtain so that it will become even more useful in the future.

Summary

We hope that SimilaR will be a useful tool, and that it will make evaluating the similarity of students’ homeworks faster and more accurate as well as a teacher’s job more convenient. With this tool, R tutors can focus on what is the most important thing in the teaching process: teaching, not searching for a plagiarism and dishonest students. Prior using the system, make sure you agree with the Terms and Conditions

References

  1. Bartoszuk M., Gagolewski M., A fuzzy R code similarity detection algorithm, In: Laurent A. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part III (CCIS 444), Springer-Verlag, Heidelberg, 2014, pp. 21-30.
  2. Bartoszuk M., Gagolewski M., Detecting similarity of R functions via a fusion of multiple heuristic methods, 2015. (submitted paper)
Posted in Blog/R, Blog/R-bloggers

Using Hadoop Streaming API to perform a word count job in R and C++

by Marek Gagolewski, Maciej Bartoszuk, Anna Cena, and Jan Lasek (Rexamine).

Introduction

In a recent blog post we explained how we managed to set up a working Hadoop environment on a few CentOS7 machines. To test the installation, let’s play with a simple example.

Hadoop Streaming API allows to run Map/Reduce jobs with any programs as the mapper and/or the reducer.

Files are processed line-by-line. Mappers get appropriate chunks of the input file. Each line is assume to store information on key-value pairs. By default, the following form is used:

key1 \t val1 \n
key2 \t val2 \n

If there is no TAB character, then the value is assumed to be NULL.

In fact this is a hadoop version of a program that rearranges lines in the input file so that duplicated lines appear one after another – the output is always sorted by key.

This is because:

hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
   -input /input/test.txt \
   -output /output
   -mapper /bin/cat
   -reducer /bin/cat
hdfs dfs -cat /output/part-00000

This is roughly equivalent to:

cat input | mapper | sort | reducer > output

More specifically, in our case that was:

cat input | cat | sort | cat > output

A sample Map/Reduce job

Let’s run a simple Map/Reduce job written in R and C++ (just for fun – we assume that all the nodes run the same operating system and they use the same CPU architecture).

  1. As we are in the CentOS 7 environment, we will need a newer version of R on all the nodes.
$ su
# yum install readline-devel
# cd
# wget http://cran.rstudio.com/src/base/R-3.1.2.tar.gz
# tar -zxf R-3.1.2.tar.gz
# cd R-3.1.2
# /configure --with-x=no --with-recommended-packages=no
# make
# make install
# R
R> install.packages('stringi')
R> q()
  1. Edit yarn-site.xml (on all nodes):
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

Without that, Hadoop may complain about too huge virtual memory memory consumption by R.

  1. Create script wc_mapper.R:
#!/usr/bin/env Rscript

library('stringi')
stdin <- file('stdin', open='r')

while(length(x <- readLines(con=stdin, n=1024L))>0) {
   x <- unlist(stri_extract_all_words(x))
   xt <- table(x)
   words <- names(xt)
   counts <- as.integer(xt)
   cat(stri_paste(words, counts, sep='\t'), sep='\n')
}
  1. Create a source file wc_reducer.cpp:
#include <iostream>
#include <string>
#include <cstdlib>

using namespace std;

int main()
{
  string line;
  string last_word = "";
  int last_count = 0;

  while(getline(cin,line))
  {
    size_t found = line.find_first_of("\t");
    if(found != string::npos)
    {
      string key = line.substr(0,found);
      string value = line.substr(found);
      int valuei = atoi(value.c_str());
      //cerr << "key=" << key << " value=" << value <<endl;
      if(key != last_word)
      {
              if(last_word != "") cout << last_word << "\t" << last_count << endl;

              last_word = key;
              last_count = valuei;
      }
      else
              last_count += valuei;
    }
  }
  if(last_word != "") cout << last_word << "\t" << last_count << endl;


  return 0;
}

Now it’s time to compile the above C++ source file:

$ g++ -O3 wc_reducer.cpp -o wc_reducer
  1. Let’s submit a map/reduce job via the Hadoop Streaming API
$ chmod 755 wc_mapper.R
$ hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
   -input /input/test.txt \
   -output /output
   -mapper wc_mapper.R
   -reducer wc_reducer
   -file wc_mapper.R
   -file wc_reducer

By the way, Fedora 20 RPM Hadoop distribution provides Hadoop Streaming API jar file under /usr/share/hadoop/mapreduce/hadoop-streaming.jar.

Summary

In this tutorial we showed how to submit a simple Map/Reduce job via the Hadoop Streaming API. Interestingly, we used an R script as the mapper and a C++ program as the reducer. In an upcoming blog post we’ll explain how to run a job using the rmr2 package.

Tagged with: , , , , ,
Posted in Blog/Hadoop, Blog/R, Blog/R-bloggers

Installing Hadoop 2.6.0 on CentOS 7

by Marek Gagolewski, Maciej Bartoszuk, Anna Cena, and Jan Lasek (Rexamine).

Configuring a working Hadoop 2.6.0 environment on CentOS 7 is a bit of a struggle. Here are the steps we made to set everything up so that we have a working hadoop cluster. Of course, there many tutorials on this topic over the internet. None of the solutions presented there worked in our case. Thus, there is a high possibility that also this step-by-step guide will make you very frustrated. Anyway, resolving errors generated by Hadoop should make you understand this environment much better. No pain no gain.

Basic CentOS setup

Let’s assume that we have a fresh CentOS install. On each node:

  1. Edit /etc/hosts
# nano /etc/hosts

Add the following lines (change IP addresses accordingly):

10.0.0.1 hmaster
10.0.0.2 hslave1
10.0.0.3 hslave2
10.0.0.4 hslave3
  1. Create user hadoop
# useradd hadoop
# passwd hadoop
  1. Set up key-based (passwordless) login:
# su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hmaster
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hslave1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hslave2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hslave3
$ chmod 0600 ~/.ssh/authorized_keys

This will be useful when we’d like to start all necessary hadoop services on all the slave nodes.

Installing Oracle Java SDK

  1. Download latest Oracle JDK and save it in the /opt directory.

  2. On hmaster, unpack Java:

# cd /opt
# tar -zxf jdk-8u31-linux-x64.tar.gz
# mv jdk1.8.0_31 jdk

Now propagete /opt/jdk to all the slaves

# scp -r jdk hslave1:/opt
# scp -r jdk hslave2:/opt
# scp -r jdk hslave3:/opt
  1. On each node, let’s use the alternatives tool to set up Oracle Java as the default Java framework.
# alternatives --install /usr/bin/java java /opt/jdk/bin/java 2
# alternatives --config java # select appropriate program (/opt/jdk/bin/java)
# alternatives --install /usr/bin/jar jar /opt/jdk/bin/jar 2
# alternatives --install /usr/bin/javac javac /opt/jdk/bin/javac 2
# alternatives --set jar /opt/jdk/bin/jar
# alternatives --set javac /opt/jdk/bin/javac 

Check if everything is OK by executing java -version.

  1. Set up environmental variables:
# nano /etc/bashrc

Add the following:

export JAVA_HOME=/opt/jdk
export JRE_HOME=/opt/jdk/jre
export PATH=$PATH:/opt/jdk/bin:/opt/jdk/jre/bin

And also possibly:

alias ll='ls -l --color'
alias cp='cp -i'
alias mv='mv -i'
alias rm='rm -i'

Check if everyting is OK:

# source /etc/bashrc
# echo $JAVA_HOME

Installing and configuring hadoop 2.6.0

On master:

# cd /opt
# wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
# tar -zxf hadoop-2.6.0.tar.gz
# rm hadoop-2.6.0.tar.gz
# mv hadoop-2.6.0 hadoop

Propagate /opt/hadoop to slave nodes:

# scp -r hadoop hslave1:/opt
# scp -r hadoop hslave2:/opt
# scp -r hadoop hslave3:/opt

Add the following lines to /home/hadoop/.bashrc on all the nodes (you may play with scp for that too):

export HADOOP_PREFIX=/opt/hadoop
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX
export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin

Edit /opt/hadoop/etc/hadoop/core-site.xml – set up NameNode URI on every node:

<configuration>
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hmaster:9000/</value>
</property>
</configuration>

Create HDFS DataNode data dirs on every node and change ownership of /opt/hadoop:

# chown hadoop /opt/hadoop/ -R
# chgrp hadoop /opt/hadoop/ -R
# mkdir /home/hadoop/datanode
# chown hadoop /home/hadoop/datanode/
# chgrp hadoop /home/hadoop/datanode/    

Edit /opt/hadoop/etc/hadoop/hdfs-site.xml – set up DataNodes:

<configuration>
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>
<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>
<property>
   <name>dfs.datanode.data.dir</name>
   <value>/home/hadoop/datanode</value>
</property>
</configuration>

Create HDFS NameNode data dirs on master:

# mkdir /home/hadoop/namenode
# chown hadoop /home/hadoop/namenode/
# chgrp hadoop /home/hadoop/namenode/    

Edit /opt/hadoop/etc/hadoop/hdfs-site.xml on master. Add further properties:

<property>
        <name>dfs.namenode.data.dir</name>
        <value>/home/hadoop/namenode</value>
</property>

Edit /opt/hadoop/etc/hadoop/mapred-site.xml on master.

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value> <!-- and not local (!) -->
 </property>
</configuration>

Edit /opt/hadoop/etc/hadoop/yarn-site.xml – setup ResourceManager and NodeManagers:

<configuration>
<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hmaster</value>
</property>
<property>
        <name>yarn.nodemanager.hostname</name>
        <value>hmaster</value> <!-- or hslave1, hslave2, hslave3 -->
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
</configuration>

Edit /opt/hadoop/etc/hadoop/slaves on master (so that master may start all necessary services on slaves automagically):

hmaster
hslave1
hslave2
hslave3

Now the important step: disable firewall and IPv6 (Hadoop does not support IPv6 – problems with listening on all the interfaces via 0.0.0.0):

# systemctl stop firewalld

Add the following lines to /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

Format NameNode:

# su hadoop
$ hdfs namenode -format

Start HDFS (as user hadoop):

$ start-dfs.sh

Check out with jps if DataNode are running on slaves and if DataNode, NameNode, and SecondaryNameNode are running on master. Also try accessing http://hmaster:50070/

Start YARN on master:

$ start-yarn.sh 

Now NodeManagers should be alive (jps) on all nodes and a ResourceManager on master too.

We see that the master node consists of a ResourceManager, NodeManager (YARN), NameNode and DataNode (HDFS). A slave node acts as both a NodeManager and a DataNode.

Testing hadoop 2.6.0

You may want to check out if you are able to copy a local file to HDFS and run the standalone Hadoop Hello World (i.e. wordcount) Job.

$ hdfs dfsadmin -safemode leave # ??????
$ hdfs dfs -mkdir /input
$ hdfs dfs -copyFromLocal test.txt /input
$ hdfs dfs -cat /input/test.txt | head
$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input/test.txt /output1

If anything went wrong, check out /opt/hadoop/log/*.log. Good luck :)

Tagged with: , , ,
Posted in Blog/Hadoop

stringi 0.4-1 released – fast, portable, consistent character string processing

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

# install.packages("stringi") or update.packages()
library("stringi")

Here’s a list of changes in version 0.4-1. In the current release, we particularly focused on making the package’s interface more consistent with that of the well-known stringr package. For a general overview of stringi’s facilities and base R string processing issues, see e.g. here.

  • (IMPORTANT CHANGE) n_max argument in stri_split_*() has been renamed n.

  • (IMPORTANT CHANGE) simplify=FALSE in stri_extract_all_*() and stri_split_*() now calls stri_list2matrix() with fill="". fill=NA_character_ may be obtained by using simplify=NA.

  • (IMPORTANT CHANGE, NEW FUNCTIONS) #120: stri_extract_words has been renamed stri_extract_all_words and stri_locate_boundariesstri_locate_all_boundaries as well as stri_locate_wordsstri_locate_all_words. New functions are now available: stri_locate_first_boundaries, stri_locate_last_boundaries, stri_locate_first_words, stri_locate_last_words, stri_extract_first_words, stri_extract_last_words.

# uses ICU's locale-dependent word break iterator
stri_extract_all_words("stringi: THE string processing package for R")
## [[1]]
## [1] "stringi"    "THE"        "string"     "processing" "package"   
## [6] "for"        "R"
  • (IMPORTANT CHANGE) #111: opts_regex, opts_collator, opts_fixed, and opts_brkiter can now be supplied individually via .... In other words, you may now simply call e.g.
stri_detect_regex(c("stringi", "STRINGI"), "stringi", case_insensitive=TRUE)
## [1] TRUE TRUE

instead of:

stri_detect_regex(c("stringi", "STRINGI"), "stringi", opts_regex=stri_opts_regex(case_insensitive=TRUE))
## [1] TRUE TRUE
  • (NEW FEATURE) #110: Fixed pattern search engine’s settings can now be supplied via opts_fixed argument in stri_*_fixed(), see stri_opts_fixed(). A simple (not suitable for natural language processing) yet very fast case_insensitive pattern matching can be performed now. stri_extract_*_fixed is again available.

  • (NEW FEATURE) #23: stri_extract_all_fixed, stri_count, and stri_locate_all_fixed may now also look for overlapping pattern matches, see ?stri_opts_fixed.

stri_extract_all_fixed("abaBAaba", "ABA", case_insensitive=TRUE, overlap=TRUE)
## [[1]]
## [1] "aba" "aBA" "aba"
  • (NEW FEATURE) #129: stri_match_*_regex gained a cg_missing argument.

  • (NEW FEATURE) #117: stri_extract_all_*(), stri_locate_all_*(), stri_match_all_*() gained a new argument: omit_no_match. Setting it to TRUE makes these functions compatible with their stringr equivalents.

  • (NEW FEATURE) #118: stri_wrap() gained indent, exdent, initial, and prefix arguments. Moreover Knuth’s dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128.

cat(stri_wrap(stri_rand_lipsum(1), 40, 2.0), sep="\n")
## Lorem ipsum dolor sit amet, et et diam
## vitae est ut. At tristique, tincidunt
## taciti, ac egestas vestibulum magna.
## Volutpat nisl non sed ultricies nisl
## nibh magna. Nullam rhoncus ut phasellus
## sed. Congue enim libero congue massa
## eget. Ligula, quis est amet velit.
## Accumsan amet nunc ad. Porttitor,
## sed vestibulum diam vestibulum quis
## sed gravida ultrices. Per urna enim.
## Scelerisque interdum sed vestibulum
## rhoncus quis imperdiet pharetra. Sapien
## iaculis, lacinia ac cras ante, sed
## vitae inceptos dis tristique dignissim.
## Venenatis volutpat lectus sodales,
## hac feugiat molestie mollis. A, urna
## pellentesque ante himenaeos ante at
## potenti in.
  • (NEW FEATURE) #122: stri_subset() gained an omit_na argument.
stri_subset_fixed(c("abc", NA, "def"), "a")
## [1] "abc" NA
stri_subset_fixed(c("abc", NA, "def"), "a", omit_na=TRUE)
## [1] "abc"
  • (NEW FEATURE) stri_list2matrix() gained an n_min argument.

  • (NEW FEATURE) #126: stri_split() now is also able to act just like stringr::str_split_fixed().

stri_split_regex(c("bab", "babab"), "a", n = 3, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "b"  "b"  ""  
## [2,] "b"  "b"  "b"
  • (NEW FEATURE) #119: stri_split_boundaries() now have n, tokens_only, and simplify arguments. Additionally, stri_extract_all_words() is now equipped with simplify arg.

  • (NEW FEATURE) #116: stri_paste() gained a new argument: ignore_null. Setting it to TRUE makes this function more compatible with paste().

for (test in c(TRUE, FALSE))
   print(stri_paste("a", if (test) 1:9, ignore_null=TRUE))
## [1] "a1" "a2" "a3" "a4" "a5" "a6" "a7" "a8" "a9"
## [1] "a"

Enjoy! Any comments and suggestions are welcome.

Tagged with: , , ,
Posted in Blog/News, Blog/R, Blog/R-bloggers

Faster, easier, and more reliable character string processing with stringi 0.3-1

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

# install.packages("stringi") or update.packages()
library("stringi")

stringi is an R package providing (but definitely not limiting to) equivalents of nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

We implemented each string processing function from scratch. The internationalization and globalization support, as well as many string processing facilities (like regex searching) is guaranteed by the well-known IBM’s ICU4C library.

Here is a very general list of the most important features available in the current version of stringi:

  • string searching:
    • with ICU (Java-like) regular expressions,
    • ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),
    • very fast, locale-independent byte-wise pattern matching;
  • joining and duplicating strings;
  • extracting and replacing substrings;
  • string trimming, padding, and text wrapping (e.g. with Knuth’s dynamic word wrap algorithm);
  • text transliteration;
  • text collation (comparing, sorting);
  • text boundary analysis (e.g. for extracting individual words);
  • random string generation;
  • Unicode normalization;
  • character encoding conversion and detection;

and many more.

Here’s a list of changes in version 0.3-1:

  • (IMPORTANT CHANGE) #87: %>% overlapped with the pipe operator from the magrittr package; now each operator like %>% has been renamed %s>%.

  • (IMPORTANT CHANGE) #108: Now the BreakIterator (for text boundary analysis) may be better controlled via stri_opts_brkiter() (see options type and locale which aim to replace now-removed boundary and locale parameters to stri_locate_boundaries, stri_split_boundaries, stri_trans_totitle, stri_extract_words, stri_locate_words).

    For example:

test <- "The\u00a0above-mentioned    features are very useful. Warm thanks to their developers. 123 456 789"
stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE, skip_word_number=TRUE)) # cf. stri_extract_words
## [[1]]
##  [1] "The"        "above"      "mentioned"  "features"   "are"       
##  [6] "very"       "useful"     "Warm"       "thanks"     "to"        
## [11] "their"      "developers"
stri_split_boundaries(test, stri_opts_brkiter(type="sentence")) # extract sentences
## [[1]]
## [1] "The above-mentioned    features are very useful. "
## [2] "Warm thanks to their developers. "                
## [3] "123 456 789"
stri_split_boundaries(test, stri_opts_brkiter(type="character")) # extract characters
## [[1]]
##  [1] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t" "i" "o" "n"
## [18] "e" "d" " " " " " " " " "f" "e" "a" "t" "u" "r" "e" "s" " " "a" "r"
## [35] "e" " " "v" "e" "r" "y" " " "u" "s" "e" "f" "u" "l" "." " " "W" "a"
## [52] "r" "m" " " "t" "h" "a" "n" "k" "s" " " "t" "o" " " "t" "h" "e" "i"
## [69] "r" " " "d" "e" "v" "e" "l" "o" "p" "e" "r" "s" "." " " "1" "2" "3"
## [86] " " "4" "5" "6" " " "7" "8" "9"

By the way, the last call also works correctly for strings not in the Unicode Normalization Form C:

stri_split_boundaries(stri_trans_nfkd("zażółć gęślą jaźń"), stri_opts_brkiter(type="character"))
## [[1]]
##  [1] "z" "a" "ż"  "ó"  "ł" "ć"  " " "g" "ę"  "ś"  "l" "ą"  " " "j" "a" "ź"  "ń"
  • (NEW FUNCTIONS) #109: stri_count_boundaries and stri_count_words count the number of text boundaries in a string.
stri_count_words("Have a nice day!")
## [1] 4
  • (NEW FUNCTIONS) #41: stri_startswith_* and stri_endswith_* determine whether a string starts or ends with a given pattern.
stri_startswith_fixed(c("a1o", "a2g", "b3a", "a4e", "c5a"), "a")
## [1]  TRUE  TRUE FALSE  TRUE FALSE
  • (NEW FEATURE) #102: stri_replace_all_* gained a vectorize_all parameter, which defaults to TRUE for backward compatibility.
stri_replace_all_fixed("The quick brown fox jumped over the lazy dog.",
     c("quick", "brown", "fox"), c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The slow black bear jumped over the lazy dog."
# Compare the results:
stri_replace_all_fixed("The quicker brown fox jumped over the lazy dog.",
     c("quick", "brown", "fox"), c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The slower black bear jumped over the lazy dog."
stri_replace_all_regex("The quicker brown fox jumped over the lazy dog.",
     "\\b"%s+%c("quick", "brown", "fox")%s+%"\\b", c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The quicker black bear jumped over the lazy dog."
  • (NEW FUNCTIONS) #91: stri_subset_*, a convenient and more efficient substitute for str[stri_detect_*(str, ...)], added.
stri_subset_regex(c("john@office.company.com", "steve1932@g00gl3.eu", "no email here"),
   "^[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\\.)+[A-Za-z]{2,4}$")
## [1] "john@office.company.com" "steve1932@g00gl3.eu"
  • (NEW FEATURE) #100: stri_split_fixed, stri_split_charclass, stri_split_regex, stri_split_coll gained a tokens_only parameter, which defaults to FALSE for backward compatibility.
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=1, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab"
## 
## [[2]]
## [1] "d"
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=2, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab" "c" 
## 
## [[2]]
## [1] "d"  "ef"
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab" "c" 
## 
## [[2]]
## [1] "d"  "ef" "g" 
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)
  • (NEW FUNCTION) #105: stri_list2matrix converts lists of atomic vectors to character matrices, useful in connection with stri_split and stri_extract.
stri_list2matrix(stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE))
##      [,1] [,2] [,3] [,4]
## [1,] "ab" "d"  "h"  NA  
## [2,] "c"  "ef" NA   NA  
## [3,] NA   "g"  NA   NA
  • (NEW FEATURE) #107: stri_split_* now allow setting an omit_empty=NA argument.
stri_split_fixed("a_b_c__d", "_", omit_empty=FALSE)
## [[1]]
## [1] "a" "b" "c" ""  "d"
stri_split_fixed("a_b_c__d", "_", omit_empty=TRUE)
## [[1]]
## [1] "a" "b" "c" "d"
stri_split_fixed("a_b_c__d", "_", omit_empty=NA)
## [[1]]
## [1] "a" "b" "c" NA  "d"
  • (NEW FEATURE) #106: stri_split and stri_extract_all gained a simplify argument (if TRUE, then stri_list2matrix(..., byrow=TRUE) is called on the resulting list.
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] "h"  NA   NA  
## [4,] NA   NA   NA
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=FALSE, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] ""   "h"  NA  
## [4,] ""   NA   NA
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=NA, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] NA   "h"  NA  
## [4,] NA   NA   NA
  • (NEW FUNCTION) #77: stri_rand_lipsum generates (pseudo)random dummy lorem ipsum text.
cat(sapply(
   stri_wrap(stri_rand_lipsum(3), 80, simplify=FALSE),
   stri_flatten, collapse="\n"), sep="\n\n")
## Lorem ipsum dolor sit amet, eu turpis pellentesque est, lectus, vestibulum.
## Iaculis et nam ad eu morbi, ultrices enim pellentesque est fusce. Etiam
## ipsum varius, maecenas dapibus. Netus molestie non adipiscing netus,
## aptent sed malesuada, placerat suscipit. A, sed eu luctus imperdiet odio
## tempor. In velit ut vel feugiat felis eros risus. Sed sapien, facilisis
## ullamcorper, senectus efficitur sit id sociis sed purus. Ipsum, a, blandit
## faucibus. In vivamus, duis et sed sollicitudin maximus. Sodales magnis
## ac senectus facilisis, dolor faucibus a. Cursus in cum, cubilia egestas
## ut platea turpis. Maximus sit vel cursus nec in vel, eu, lacinia in ut.
## 
## Libero maximus potenti penatibus amet nisl non ut. Commodo nullam rhoncus,
## bibendum quisque sem aliquam sed, quam enim et, sed. Lacinia netus inceptos
## sapien nostra tincidunt facilisis montes nascetur non pharetra convallis
## id. Netus diam nulla montes nec tincidunt facilisis eros porttitor nisl urna
## cubilia. Aliquet egestas mus nisl, nisi vehicula, ac mauris rutrum, felis
## aenean tristique magna. Ante maecenas phasellus id class. Finibus iaculis purus
## volutpat posuere phasellus magna class blandit augue morbi torquent. Taciti
## ullamcorper venenatis at nulla eget auctor ante neque metus sed metus. Dolor,
## platea sit sed pellentesque ipsum. Dapibus sed nisi vestibulum ex integer.
## 
## Duis iaculis sapien habitasse, facilisi habitasse leo nam. Egestas,
## libero tempor purus in. Aliquam himenaeos conubia egestas cum vestibulum
## nec. Sociosqu mauris cum mus non lobortis eu et dapibus vel integer.
## Blandit quis inceptos cursus vel pellentesque lectus amet egestas.
## Pharetra ac eros nisi. Finibus nec, ac congue in molestie sed.
## Tincidunt faucibus a interdum facilisis, sed nulla, tortor, felis,
## sociis. Sem porttitor himenaeos pharetra nec eu torquent elementum.
  • (NEW FEATURE) #98: stri_trans_totitle gained a opts_brkiter parameter; it indicates which ICU BreakIterator should be used when performing case mapping.
stri_trans_totitle("GOOD-OLD cOOkiE mOnSTeR IS watCHinG You. Here HE comes!",
    stri_opts_brkiter(type="word")) # default boundary
## [1] "Good-Old Cookie Monster Is Watching You. Here He Comes!"
stri_trans_totitle("GOOD-OLD cOOkiE mOnSTeR IS watCHinG You. Here HE comes!",
    stri_opts_brkiter(type="sentence"))
## [1] "Good-old cookie monster is watching you. Here he comes!"
  • (NEW FEATURE) stri_wrap gained a new parameter: normalize.

  • (BUGFIX) #86: stri_*_fixed, stri_*_coll, and stri_*_regex could give incorrect results if one of search strings were of length 0.

  • (BUGFIX) #99: stri_replace_all did not use the replacement arg.

  • (BUGFIX) #94: R CMD check should no longer fail if icudt download failed.

  • (BUGFIX) #112: Some of the objects were not PROTECTed from being garbage collected, which might have caused spontaneous SEGFAULTS.

  • (BUGFIX) Some collator’s options were not passed correctly to ICU services.

  • (BUGFIX) Memory leaks causes as detected by valgrind --tool=memcheck --leak-check=full have been removed.

  • (DOCUMENTATION) Significant extensions/clean ups in the stringi manual.

    Check out yourself. In particular, take a glimpse at stringi-search-regex, stringi-search-charclass and, more generally, at stringi-search.

Enjoy! Any comments and suggestions are welcome.

Tagged with: , , , , , ,
Posted in Blog/R, Blog/R-bloggers

R now will keep children away from drugs

Do you find this plot fancy? If yes, you can find the code at the end of this article BUT if you spend a little time to read it thoroughly, you can learn how to create better ones.

We would like to encourage you and your children (or children you teach) to use our new R package – TurtleGraphics!

TurtleGraphics package offers R-users functionality of the “turtle graphics” from Logo educational programming language. The main idea standing behind it is to inspire the children to learn programming and show that working with computer can be entertaining and creative.

It is very elementary, clear and requires basic algorithm thinking skills, that even children are able to form them. You can learn it in just five short steps.

  • turtle_init() – To start the program call the turtle_init() function. It creates a plot region (called “Terrarium”) and places the Turtle in the middle pointing north.
library(TurtleGraphics)
turtle_init()

  • turtle_forward() and turtle_backward() – Argument to these functions is the distance you desire the Turtle to move. For example, to move the Turtle forward for a distance of 10 units use the turtle_forward() function. To move the Turtle backwards you can use the turtle_backward() function.
turtle_forward(dist=15)

  • turtle_turn()turtle_right() and turtle_left(). They change the Turtle's direction by a given angle.
turtle_right(angle=30)

  • turtle_up() and turtle_down() – To disable the path from being drawn you can simply use the turtle_up() function. Let's consider a simple example. We use the turtle_up() function. Now, when you move forward the path is not visible. If you want the path to be drawn again you should call the turtle_down() function.
turtle_up()
turtle_forward(dist=10)
turtle_down()
turtle_forward(dist=10)

  • turtle_show() and turtle_hide() – Similarly, you may show or hide the Turtle image, using the turtle_show() and turtle_hide() functions respectively. If you call a lot of functions it is strongly recommended to hide the Turtle first as it speeds up the process.
turtle_hide()

These were just the basics of the package. Below we show you the true potential of it:

  turtle_star <- function(intensity=1){
  y <- sample(1:657, 360*intensity, replace=TRUE)
  for (i in 1:(360*intensity)){
  turtle_right(90)
  turtle_col(colors()[y[i]])
  x <- sample(1:100,1)
  turtle_forward(x)
  turtle_up()
  turtle_backward(x)
  turtle_down()
  turtle_left(90)
  turtle_forward(1/intensity)
  turtle_left(1/intensity)
  }}
  turtle_init(500,500)
  turtle_left(90)
  turtle_do({
  turtle_star(7)
  })

One may wonder what turtle_do() function is doing here. It is an advanced way to use the package. The turtle_do() function is designed to call more complicated plot expressions, because it automatically hides the Turtle before starting the operations that results in a faster proceed of plotting.

  drawTriangle<- function(points){
  turtle_setpos(points[1,1],points[1,2])
  turtle_goto(points[2,1],points[2,2])
  turtle_goto(points[3,1],points[3,2])
  turtle_goto(points[1,1],points[1,2])
  }
  getMid<- function(p1,p2) c((p1[1]+p2[1])/2, c(p1[2]+p2[2])/2)
  sierpinski <- function(points, degree){
  drawTriangle(points)
  if (degree  > 0){
  p1 <- matrix(c(points[1,], getMid(points[1,], points[2,]),
  getMid(points[1,], points[3,])), nrow=3, byrow=TRUE)

  sierpinski(p1, degree-1)
  p2 <- matrix(c(points[2,], getMid(points[1,], points[2,]),
  getMid(points[2,], points[3,])), nrow=3, byrow=TRUE)

  sierpinski(p2, degree-1)
  p3 <- matrix(c(points[3,], getMid(points[3,], points[2,]),
  getMid(points[1,], points[3,])), nrow=3, byrow=TRUE)
  sierpinski(p3, degree-1)
  }
  invisible(NULL)
  }
  turtle_init(520, 500, "clip")
  p <- matrix(c(10, 10, 510, 10, 250, 448), nrow=3, byrow=TRUE)
  turtle_col("red")
  turtle_do(sierpinski(p, 6))
  turtle_setpos(250, 448)

We kindly invite you to use TurtleGraphics! Enjoy!
A full tutorial of this package is available here.

Marcin Kosinski, m.p.kosinski@gmail.com
Natalia Potocka, natalia-po@hotmail.com

Tagged with: , , ,
Posted in Blog/R, Blog/R-bloggers

Playing with GUIs in R with RGtk2

Sometimes when we create some nice functions which we want to show other people who don’t know R we can do two things. We can teach them R what is not easy task which also takes time or we can make GUI allowing them to use these functions without any knowledge of R. This post is my first attempt to create a GUI in R. Although it can be done in many ways, we will use the package RGtk2, so before we start you will need:

require("RGtk2")

I will try to show you making GUI on an example. I want to make an application which works like calculator. It should have two text fields: first with a expression to calculate and second with result. I want to include button which makes it calculate. It should display error message when there is a mistake in the expression. Also I want two buttons to insert sin() and cos () into text field. Last thing is a combobox allowing us to choose between integer and double result.

Firstly we need to make window and frame.

window <- gtkWindow()
window["title"] <- "Calculator"


frame <- gtkFrameNew("Calculate")
window$add(frame)

It should look like this:

Now, let’s make two boxes. To the first box we put components vertically and horizontally to the second box.

box1 <- gtkVBoxNew()
box1$setBorderWidth(30)
frame$add(box1)   #add box1 to the frame

box2 <- gtkHBoxNew(spacing= 10) #distance between elements
box2$setBorderWidth(24)

This should look exactly as before because we don’t have any component in boxes yet, also box2 isn’t even added to our window. So let’s put some elements in.

TextToCalculate<- gtkEntryNew() #text field with expresion to calculate
TextToCalculate$setWidthChars(25)
box1$packStart(TextToCalculate)

label = gtkLabelNewWithMnemonic("Result") #text label
box1$packStart(label)

result<- gtkEntryNew() #text field with result of our calculation
result$setWidthChars(25)
box1$packStart(result)

box2 <- gtkHBoxNew(spacing= 10) # distance between elements
box2$setBorderWidth(24)
box1$packStart(box2)

Calculate <- gtkButton("Calculate")
box2$packStart(Calculate,fill=F) #button which will start calculating

Sin <- gtkButton("Sin") #button to paste sin() to TextToCalculate
box2$packStart(Sin,fill=F)

Cos <- gtkButton("Cos") #button to paste cos() to TextToCalculate
box2$packStart(Cos,fill=F)

model<-rGtkDataFrame(c("double","integer"))
combobox <- gtkComboBox(model)
#combobox allowing to decide whether we want result as integer or double

crt <- gtkCellRendererText()
combobox$packStart(crt)
combobox$addAttribute(crt, "text", 0)

gtkComboBoxSetActive(combobox,0)
box2$packStart(combobox)

Now we should have something like this:

Note that our window gets bigger when we put bigger components in it. However nothing is working as intended. We need to explain buttons what to do when we click them:

DoCalculation<-function(button)
{

  if ((TextToCalculate$getText())=="") return(invisible(NULL)) #if no text do nothing

   #display error if R fails at calculating
   tryCatch(
      if (gtkComboBoxGetActive(combobox)==0)
   result$setText((eval(parse(text=TextToCalculate$getText()))))
   else (result$setText(as.integer(eval(parse(text=TextToCalculate$getText()))))),
   error=function(e)
      {
      ErrorBox <- gtkDialogNewWithButtons("Error",window, "modal","gtk-ok", GtkResponseType["ok"])
      box1 <- gtkVBoxNew()
      box1$setBorderWidth(24)
      ErrorBox$getContentArea()$packStart(box1)

      box2 <- gtkHBoxNew()
      box1$packStart(box2)

      ErrorLabel <- gtkLabelNewWithMnemonic("There is something wrong with your text!")
      box2$packStart(ErrorLabel)
      response <- ErrorBox$run()


      if (response == GtkResponseType["ok"])
         ErrorBox$destroy()

      }
   )

}


  PasteSin<-function(button)
{
   TextToCalculate$setText(paste(TextToCalculate$getText(),"sin()",sep=""))

}

PasteCos<-function(button)
{
   TextToCalculate$setText(paste(TextToCalculate$getText(),"cos()",sep=""))

}

#however button variable was never used inside 
#functions, without it gSignalConnect would not work
gSignalConnect(Calculate, "clicked", DoCalculation)
gSignalConnect(Sin, "clicked", PasteSin)
gSignalConnect(Cos, "clicked", PasteCos)

Now it works like planned.

Also we get a nice error message.

Wiktor Ryciuk
wryciuk@poczta.onet.pl

Tagged with: , , ,
Posted in Blog/R, Blog/R-bloggers

Text mining in R – Automatic categorization of Wikipedia articles

Text mining is currently a live issue in data analysis. Enoromus text data resourses on the Internet made it an important component of Big Data world. The potential of information hidden in the words is the reason why I find worth knowing what’s going on.

I wanted to learn about R text analysis capabilities and this post is the result of my small research. More precisely, this is an example of (hierarchical) categorization of Wikipedia articles. I share the source code here and explain it, so that everyone could try it oneself with various articles.

I use tm package which provides the set of tools for text mining. Also package stringi is useful here for string processing.

First of all, we have to load the data. In the variable titles I list some of the titles of the Wikipedia articles. There are 5 mathematical terms (3 of them are about integrals), 3 painters and 3 writers. After loading the articles (as texts – html page sources), we make a container for them, called “Corpus”. It’s a structure for storing text documents, which is just a kind of a list, containing text documents and metadata that concern them.

library(tm)
library(stringi)
library(proxy)
wiki <- "http://en.wikipedia.org/wiki/"
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative",
    "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
    "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")
articles <- character(length(titles))

for (i in 1:length(titles)) {
    articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")
}

docs <- Corpus(VectorSource(articles))

As we have already loaded the data, we can start to process our text documents. This is the first step of text analysis. It’s important because preparing the data strongly affects the results. Now we apply the function tm_map to the corpus, which is equivalent to lapply for list. What we do here is:

  1. Replace all “” elements with a space. We do it because there are not a part of text document but in general a html code.
  2. Replace all “/t” with a space.
  3. Convert previous result (returned type was “string”) to “PlainTextDocument”, so that we can apply the other functions from tm package, which require this type of argument.
  4. Remove extra whitespaces from the documents.
  5. Remove punctuation marks.
  6. Remove from the documents words which we find redundant for text mining (e.g. pronouns, conjunctions). We set this words as stopwords(“english”) which is a built-in list for English language (this argument is passed to the function removeWords.
  7. Transform characters to lower case.
docs[[1]]

docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))

docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, tolower)

docs8[[1]]

We can look at the results of the “cleaned” text. Instead of this:


“The volume of irregular objects can be measured with precision by the fluid < a href=”/wiki/Displacement_(fluid)” title=”Displacement (fluid)”>displaced</a> as the object is submerged; see < a href=”/wiki/Archimedes” title=”Archimedes”>Archimedes</a>’s <a href=”/wiki/Eureka_(word)” title=”Eureka (word)”>Eureka</a>.”

now we have this:


“the volume irregular objects can measured precision fluid displaced object submerged see archimedes s eureka”

Now we are ready to proceed to the heart of the analysis. The starting point is creating “Term document matrix”. It describes the frequency of each term in each document in the corpus. This is a fundamental object in the text analysis. Based on it we create a matrix of dissimilarities – it measures dissimilarity between documents (the function dissimilarity returns an object of class dist – it is a convenience because clustering functions require this type of argument). At last we apply the function hclust (but it can be any clusterig function) and we see result on the plot.

docsTDM <- TermDocumentMatrix(docs8)

docsdissim <- dissimilarity(docsTDM, method = "cosine")

docsdissim2 <- as.matrix(docsdissim)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
docsdissim2
h <- hclust(docsdissim, method = "ward")
plot(h, labels = titles, sub = "")
plot of chunk unnamed-chunk-4

As we can see, the result is perfect here. Of course it is because chosen articles are easy to categorize. On the left side, writers made one small cluster and painters the second. Next this both clusters made bigger cluster of people. On the right side, integrals made one cluster and next two terms joined it and made together bigger cluster of mathematical terms.

This example is only a piece of R text mining capabilities. I think that you can easily proceed other text analysis as concept extraction, sentiment analysis and information extraction in general.

I give some sources for more information about text mining in R: cran.r-project, r-bloggers, onepager.togaware.com, jstatsoft.org.

Norbert Ryciak
norbertryciak@gmail.com

Tagged with: , , ,
Posted in Blog/R, Blog/R-bloggers

ICU Unicode text transforms in the R package stringi

The ICU (International Components for Unicode) library provides very powerful and flexible ways to apply various Unicode text transforms. These include:

  • Full (language-specific) case mappings,
  • Unicode normalization,
  • Text transliteration (e.g. script-to-script conversion).

All of these are available to R programmers/users via our still maturing stringi package.

Case Mappings

Mapping of upper-, lower-, and title-case characters may seem to be a straightforward task, but just a quick glimpse at the latest Unicode standard (Secs. 3.13, 4.2, and 5.18) will suffice to convince us that case mapping rules are very complex. In one of my previous posts I've already mentioned that “base R” performs (at least on my machine) only a single-character case conversion:

toupper("groß") # German: -> GROSS
## [1] "GROß"

Notably, the case conversion in R is language-dependent:

toupper("ıi") # Polish locale is default here
## [1] "II"
oldloc <- Sys.getlocale("LC_CTYPE")
Sys.setlocale("LC_CTYPE", "tr_TR.UTF-8")  # Turkish

toupper("ıi") # dotless i and latin i -> latin I and I with dot above (OK)
## [1] "Iİ"
Sys.setlocale("LC_CTYPE", oldloc)

This language-sensitivity is of course desirable when it comes to natural language processing. Unfortunately, a few more examples might be found for which toupper() and tolower() does not meet the Unicode guidelines. Generally, a proper case map can change the number of code points/units of a string, is language-sensitive and context-sensitive (character mapping may depend on its surrounding characters). Luckily, we have the case mapping facilities implemented in the ICU library, which provides us with all we need:

library(stringi)
stri_trans_toupper("groß", locale = "de_DE") # German
## [1] "GROSS"
stri_trans_totitle("ijsvrij yoghurt", locale = "nl_NL")  # Dutch
## [1] "IJsvrij Yoghurt"
stri_trans_toupper("ıi", locale = "tr_TR")
## [1] "Iİ"
stri_trans_tolower("İI", locale = "tr_TR")
## [1] "iı"

By the way, ICU doesn't have any list of non-capitalized words for language-dependent title casing (e.g. pining for the fjords in English is most often mapped to Pining for the Fjords), so such tasks must be performed manually.

Unicode Normalization

The following string:

'\u01f1DZ'
## [1] "DZDZ"

consists of 3 Unicode code points: a Unicode character LATIN CAPITAL LETTER DZ (U+01F1) and then Latin letters D and Z. Even though both DZs may look different in your Web browser, the appear as identical (well, almost) in RStudio (at least on my computer). Take a try yourself, that's really interesting.

A tricky question: how many \code{DZ}s are in the above string? 2 or 1? Considering raw code points (in a byte-wise manner) we'd answer 1. But for natural language processing a better answer is probably 2. This is one of a few cases in which the Unicode normalization (see here and here for more information) is of interest.

Without going into much detail let's just say that there are few normalization forms (NFC, NFD, NFKC, NFKD, NFKC_casefold); each one serves for different purposes. Unless you're an author of some string processing package, these won't interest you too much (it's the developer's responsibility to provide an on-the-fly normalization). Anyway, the Unicode normalization may be performed with ICU:

stri_trans_nfkc('\u01f1DZ')
## [1] "DZDZ"
stri_trans_nfc('a\u0328') # a and ogonek => a with ogonek
## [1] "ą"
stri_trans_nfkc("\ufdfa") # 1 codepoint -> 18 codepoints
## [1] "صلى الله عليه وسلم"

Fortunately, an ordinary user may keep calm: many string processing tasks in stringi just take care of a proper transformation automatically. This includes string searching, sorting, and comparing functions:

stri_count_coll('\u01f1DZ', 'DZ', stri_opts_collator(strength=2)) # how many DZs are there?
## [1] 2
'ą' %==% 'a\u0328' # are they canonically equivalent?
## [1] TRUE

General Text Transforms

If you were patient and persistent enough with reading this post and arrived at this very section, here's the frosting on the cake: ICU general text transforms.

First of all, general transforms allow us to perform all the above-mentioned operations (however, they are not locale-dependent). For example:

stri_trans_general("DZDZ", "nfkc")
## [1] "DZDZ"
stri_trans_general("groß", "upper")
## [1] "GROSS"

Here, the 2nd argument of stri_trans_general denotes the transformation to apply. The list of available transforms is returned by a call to:

head(stri_trans_list())
## [1] "ASCII-Latin"       "Accents-Any"       "Amharic-Latin/BGN"
## [4] "Any-Accents"       "Any-Publishing"    "Arabic-Latin"

General text transforms can perform:

  • Hex and Character Name conversions (e.g. for escaping Unicode code points),
  • Script to Script conversion (a.k.a. text transliteration),
  • etc.

For more information on text transforms, refer to the ICU documentation. I admit that the user's guide is not easy to follow, but it may allow you to do some magic tricks with your text, so it's worth reading.

Notably, text transformations may be composed (so that many operations may be performed one by one in a single call) and we are able to tell ICU to restrict processing only to a fixed set of Unicode code points.

A bunch of examples: firstly, some script-to-script conversions (not to be confused with text translation):

stri_trans_general("stringi", "latin-greek") # script transliteration
## [1] "στριγγι"
stri_trans_general("Пётр Ильич Чайковский", "cyrillic-latin") # script transliteration
## [1] "Pëtr Ilʹič Čajkovskij"
stri_trans_general("Пётр Ильич Чайковский", "cyrillic-latin; nfd; [:nonspacing mark:] remove; nfc")  # and remove accents
## [1] "Petr Ilʹic Cajkovskij"
stri_trans_general("zażółć gęślą jaźń", "latin-ascii")   # remove diacritic marks
## [1] "zazolc gesla jazn"

What I really love in the first example above is that from ng we obtain γγ (gamma,gamma) and not νγ (nu, gamma). Cute, isn't it?

It's getting hotter with these:

stri_trans_general("w szczebrzeszynie chrząszcz brzmi w trzcinie", "pl-pl_fonipa")
## [1] "v ʂt͡ʂɛbʐɛʂɨɲɛ xʂɔ̃ʂt͡ʂ bʐmi v tʂt͡ɕiɲɛ"
# and now the same in the XSampa ASCII-range representation:
stri_trans_general("w szczebrzeszynie chrząszcz brzmi w trzcinie", "pl-pl_fonipa; ipa-xsampa")
## [1] "v s`t_s`Ebz`Es`1JE xs`O~s`t_s` bz`mi v ts`t_s\\iJE"

We've obtained the phonetic representation of a Polish text (in IPA) – try reading that tongue twister aloud (in case of any problems consult this Wikipedia article).

We may also escape a selected set of code points (to hex representation as well as e.g. to XML entities) or even completely remove them:

stri_trans_general("zażółć gęślą jaźń", "[^\\u0000-\\u007f] any-hex") # filtered
## [1] "za\\u017C\\u00F3\\u0142\\u0107 g\\u0119\\u015Bl\\u0105 ja\\u017A\\u0144"
stri_trans_general("zażółć gęślą jaźń", "[^\\u0000-\\u007f] any-hex/xml")
## [1] "za&#x17C;&#xF3;&#x142;&#x107; g&#x119;&#x15B;l&#x105; ja&#x17A;&#x144;"
stri_trans_general("zażółć gęślą jaźń", "[\\p{Z}] remove")
## [1] "zażółćgęśląjaźń"

…and play with code point names:

stri_trans_general("ą1©,", "any-name")
## [1] "\\N{LATIN SMALL LETTER A WITH OGONEK}\\N{DIGIT ONE}\\N{COPYRIGHT SIGN}\\N{COMMA}"
stri_trans_general("\\N{LATIN SMALL LETTER SHARP S}", "name-any")
## [1] "ß"

Last but not least:

stri_trans_general("Let's go -- she said", "any-publishing")
## [1] "Let’s go — she said"

Did you note the differences?

A Note on BiDi Text (Help Needed)

ICU also provides support for processing Bidirectional text (e.g. a text that consists of a mixture of Arabic/Hebrew and English). We would be very glad to implement such facilities, but, as we (developers of stringi) come from a “Latin” environment, we don't have good guidelines on how the BiDi/RTL (right-to-left) text processing functions should behave. We don't even know whether such a text displays properly in RStudio or R GUI on Windows. Therefore, we'll warmly welcome any volunteers that would like to help us with the mentioned issues (developers or just testers).

For bug reports and feature requests concerning the stringi package visit our GitHub profile or contact me via email.

So…

stri_trans_general("¡muy bueno mi amigo, hasta la vista! :-)", "es-es_FONIPA")
## [1] "¡muiβwenomiamiɣo,.astalaβista!:)"

Marek Gagolewski

Tagged with: , , , ,
Posted in Blog/R, Blog/R-bloggers