Playing with GUIs in R with RGtk2

Sometimes when we create some nice functions which we want to show other people who don’t know R we can do two things. We can teach them R what is not easy task which also takes time or we can make GUI allowing them to use these functions without any knowledge of R. This post is my first attempt to create a GUI in R. Although it can be done in many ways, we will use the package RGtk2, so before we start you will need:


I will try to show you making GUI on an example. I want to make an application which works like calculator. It should have two text fields: first with a expression to calculate and second with result. I want to include button which makes it calculate. It should display error message when there is a mistake in the expression. Also I want two buttons to insert sin() and cos () into text field. Last thing is a combobox allowing us to choose between integer and double result.

Firstly we need to make window and frame.

window <- gtkWindow()
window["title"] <- "Calculator"

frame <- gtkFrameNew("Calculate")

It should look like this:

Now, let’s make two boxes. To the first box we put components vertically and horizontally to the second box.

box1 <- gtkVBoxNew()
frame$add(box1)   #add box1 to the frame

box2 <- gtkHBoxNew(spacing= 10) #distance between elements

This should look exactly as before because we don’t have any component in boxes yet, also box2 isn’t even added to our window. So let’s put some elements in.

TextToCalculate<- gtkEntryNew() #text field with expresion to calculate

label = gtkLabelNewWithMnemonic("Result") #text label

result<- gtkEntryNew() #text field with result of our calculation

box2 <- gtkHBoxNew(spacing= 10) # distance between elements

Calculate <- gtkButton("Calculate")
box2$packStart(Calculate,fill=F) #button which will start calculating

Sin <- gtkButton("Sin") #button to paste sin() to TextToCalculate

Cos <- gtkButton("Cos") #button to paste cos() to TextToCalculate

combobox <- gtkComboBox(model)
#combobox allowing to decide whether we want result as integer or double

crt <- gtkCellRendererText()
combobox$addAttribute(crt, "text", 0)


Now we should have something like this:

Note that our window gets bigger when we put bigger components in it. However nothing is working as intended. We need to explain buttons what to do when we click them:


  if ((TextToCalculate$getText())=="") return(invisible(NULL)) #if no text do nothing

   #display error if R fails at calculating
      if (gtkComboBoxGetActive(combobox)==0)
   else (result$setText(as.integer(eval(parse(text=TextToCalculate$getText()))))),
      ErrorBox <- gtkDialogNewWithButtons("Error",window, "modal","gtk-ok", GtkResponseType["ok"])
      box1 <- gtkVBoxNew()

      box2 <- gtkHBoxNew()

      ErrorLabel <- gtkLabelNewWithMnemonic("There is something wrong with your text!")
      response <- ErrorBox$run()

      if (response == GtkResponseType["ok"])







#however button variable was never used inside 
#functions, without it gSignalConnect would not work
gSignalConnect(Calculate, "clicked", DoCalculation)
gSignalConnect(Sin, "clicked", PasteSin)
gSignalConnect(Cos, "clicked", PasteCos)

Now it works like planned.

Also we get a nice error message.

Wiktor Ryciuk

Tagged with: , , ,
Posted in Blog/R, Blog/R-bloggers

Text mining in R – Automatic categorization of Wikipedia articles

Text mining is currently a live issue in data analysis. Enoromus text data resourses on the Internet made it an important component of Big Data world. The potential of information hidden in the words is the reason why I find worth knowing what’s going on.

I wanted to learn about R text analysis capabilities and this post is the result of my small research. More precisely, this is an example of (hierarchical) categorization of Wikipedia articles. I share the source code here and explain it, so that everyone could try it oneself with various articles.

I use tm package which provides the set of tools for text mining. Also package stringi is useful here for string processing.

First of all, we have to load the data. In the variable titles I list some of the titles of the Wikipedia articles. There are 5 mathematical terms (3 of them are about integrals), 3 painters and 3 writers. After loading the articles (as texts – html page sources), we make a container for them, called “Corpus”. It’s a structure for storing text documents, which is just a kind of a list, containing text documents and metadata that concern them.

wiki <- ""
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative",
    "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
    "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")
articles <- character(length(titles))

for (i in 1:length(titles)) {
    articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")

docs <- Corpus(VectorSource(articles))

As we have already loaded the data, we can start to process our text documents. This is the first step of text analysis. It’s important because preparing the data strongly affects the results. Now we apply the function tm_map to the corpus, which is equivalent to lapply for list. What we do here is:

  1. Replace all “” elements with a space. We do it because there are not a part of text document but in general a html code.
  2. Replace all “/t” with a space.
  3. Convert previous result (returned type was “string”) to “PlainTextDocument”, so that we can apply the other functions from tm package, which require this type of argument.
  4. Remove extra whitespaces from the documents.
  5. Remove punctuation marks.
  6. Remove from the documents words which we find redundant for text mining (e.g. pronouns, conjunctions). We set this words as stopwords(“english”) which is a built-in list for English language (this argument is passed to the function removeWords.
  7. Transform characters to lower case.

docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))

docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, tolower)


We can look at the results of the “cleaned” text. Instead of this:

“The volume of irregular objects can be measured with precision by the fluid < a href=”/wiki/Displacement_(fluid)” title=”Displacement (fluid)”>displaced</a> as the object is submerged; see < a href=”/wiki/Archimedes” title=”Archimedes”>Archimedes</a>’s <a href=”/wiki/Eureka_(word)” title=”Eureka (word)”>Eureka</a>.”

now we have this:

“the volume irregular objects can measured precision fluid displaced object submerged see archimedes s eureka”

Now we are ready to proceed to the heart of the analysis. The starting point is creating “Term document matrix”. It describes the frequency of each term in each document in the corpus. This is a fundamental object in the text analysis. Based on it we create a matrix of dissimilarities – it measures dissimilarity between documents (the function dissimilarity returns an object of class dist – it is a convenience because clustering functions require this type of argument). At last we apply the function hclust (but it can be any clusterig function) and we see result on the plot.

docsTDM <- TermDocumentMatrix(docs8)

docsdissim <- dissimilarity(docsTDM, method = "cosine")

docsdissim2 <- as.matrix(docsdissim)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
h <- hclust(docsdissim, method = "ward")
plot(h, labels = titles, sub = "")
plot of chunk unnamed-chunk-4

As we can see, the result is perfect here. Of course it is because chosen articles are easy to categorize. On the left side, writers made one small cluster and painters the second. Next this both clusters made bigger cluster of people. On the right side, integrals made one cluster and next two terms joined it and made together bigger cluster of mathematical terms.

This example is only a piece of R text mining capabilities. I think that you can easily proceed other text analysis as concept extraction, sentiment analysis and information extraction in general.

I give some sources for more information about text mining in R: cran.r-project, r-bloggers,,

Norbert Ryciak

Tagged with: , , ,
Posted in Blog/R, Blog/R-bloggers

ICU Unicode text transforms in the R package stringi

The ICU (International Components for Unicode) library provides very powerful and flexible ways to apply various Unicode text transforms. These include:

  • Full (language-specific) case mappings,
  • Unicode normalization,
  • Text transliteration (e.g. script-to-script conversion).

All of these are available to R programmers/users via our still maturing stringi package.

Case Mappings

Mapping of upper-, lower-, and title-case characters may seem to be a straightforward task, but just a quick glimpse at the latest Unicode standard (Secs. 3.13, 4.2, and 5.18) will suffice to convince us that case mapping rules are very complex. In one of my previous posts I've already mentioned that “base R” performs (at least on my machine) only a single-character case conversion:

toupper("groß") # German: -> GROSS
## [1] "GROß"

Notably, the case conversion in R is language-dependent:

toupper("ıi") # Polish locale is default here
## [1] "II"
oldloc <- Sys.getlocale("LC_CTYPE")
Sys.setlocale("LC_CTYPE", "tr_TR.UTF-8")  # Turkish

toupper("ıi") # dotless i and latin i -> latin I and I with dot above (OK)
## [1] "Iİ"
Sys.setlocale("LC_CTYPE", oldloc)

This language-sensitivity is of course desirable when it comes to natural language processing. Unfortunately, a few more examples might be found for which toupper() and tolower() does not meet the Unicode guidelines. Generally, a proper case map can change the number of code points/units of a string, is language-sensitive and context-sensitive (character mapping may depend on its surrounding characters). Luckily, we have the case mapping facilities implemented in the ICU library, which provides us with all we need:

stri_trans_toupper("groß", locale = "de_DE") # German
## [1] "GROSS"
stri_trans_totitle("ijsvrij yoghurt", locale = "nl_NL")  # Dutch
## [1] "IJsvrij Yoghurt"
stri_trans_toupper("ıi", locale = "tr_TR")
## [1] "Iİ"
stri_trans_tolower("İI", locale = "tr_TR")
## [1] "iı"

By the way, ICU doesn't have any list of non-capitalized words for language-dependent title casing (e.g. pining for the fjords in English is most often mapped to Pining for the Fjords), so such tasks must be performed manually.

Unicode Normalization

The following string:

## [1] "DZDZ"

consists of 3 Unicode code points: a Unicode character LATIN CAPITAL LETTER DZ (U+01F1) and then Latin letters D and Z. Even though both DZs may look different in your Web browser, the appear as identical (well, almost) in RStudio (at least on my computer). Take a try yourself, that's really interesting.

A tricky question: how many \code{DZ}s are in the above string? 2 or 1? Considering raw code points (in a byte-wise manner) we'd answer 1. But for natural language processing a better answer is probably 2. This is one of a few cases in which the Unicode normalization (see here and here for more information) is of interest.

Without going into much detail let's just say that there are few normalization forms (NFC, NFD, NFKC, NFKD, NFKC_casefold); each one serves for different purposes. Unless you're an author of some string processing package, these won't interest you too much (it's the developer's responsibility to provide an on-the-fly normalization). Anyway, the Unicode normalization may be performed with ICU:

## [1] "DZDZ"
stri_trans_nfc('a\u0328') # a and ogonek => a with ogonek
## [1] "ą"
stri_trans_nfkc("\ufdfa") # 1 codepoint -> 18 codepoints
## [1] "صلى الله عليه وسلم"

Fortunately, an ordinary user may keep calm: many string processing tasks in stringi just take care of a proper transformation automatically. This includes string searching, sorting, and comparing functions:

stri_count_coll('\u01f1DZ', 'DZ', stri_opts_collator(strength=2)) # how many DZs are there?
## [1] 2
'ą' %==% 'a\u0328' # are they canonically equivalent?
## [1] TRUE

General Text Transforms

If you were patient and persistent enough with reading this post and arrived at this very section, here's the frosting on the cake: ICU general text transforms.

First of all, general transforms allow us to perform all the above-mentioned operations (however, they are not locale-dependent). For example:

stri_trans_general("DZDZ", "nfkc")
## [1] "DZDZ"
stri_trans_general("groß", "upper")
## [1] "GROSS"

Here, the 2nd argument of stri_trans_general denotes the transformation to apply. The list of available transforms is returned by a call to:

## [1] "ASCII-Latin"       "Accents-Any"       "Amharic-Latin/BGN"
## [4] "Any-Accents"       "Any-Publishing"    "Arabic-Latin"

General text transforms can perform:

  • Hex and Character Name conversions (e.g. for escaping Unicode code points),
  • Script to Script conversion (a.k.a. text transliteration),
  • etc.

For more information on text transforms, refer to the ICU documentation. I admit that the user's guide is not easy to follow, but it may allow you to do some magic tricks with your text, so it's worth reading.

Notably, text transformations may be composed (so that many operations may be performed one by one in a single call) and we are able to tell ICU to restrict processing only to a fixed set of Unicode code points.

A bunch of examples: firstly, some script-to-script conversions (not to be confused with text translation):

stri_trans_general("stringi", "latin-greek") # script transliteration
## [1] "στριγγι"
stri_trans_general("Пётр Ильич Чайковский", "cyrillic-latin") # script transliteration
## [1] "Pëtr Ilʹič Čajkovskij"
stri_trans_general("Пётр Ильич Чайковский", "cyrillic-latin; nfd; [:nonspacing mark:] remove; nfc")  # and remove accents
## [1] "Petr Ilʹic Cajkovskij"
stri_trans_general("zażółć gęślą jaźń", "latin-ascii")   # remove diacritic marks
## [1] "zazolc gesla jazn"

What I really love in the first example above is that from ng we obtain γγ (gamma,gamma) and not νγ (nu, gamma). Cute, isn't it?

It's getting hotter with these:

stri_trans_general("w szczebrzeszynie chrząszcz brzmi w trzcinie", "pl-pl_fonipa")
## [1] "v ʂt͡ʂɛbʐɛʂɨɲɛ xʂɔ̃ʂt͡ʂ bʐmi v tʂt͡ɕiɲɛ"
# and now the same in the XSampa ASCII-range representation:
stri_trans_general("w szczebrzeszynie chrząszcz brzmi w trzcinie", "pl-pl_fonipa; ipa-xsampa")
## [1] "v s`t_s`Ebz`Es`1JE xs`O~s`t_s` bz`mi v ts`t_s\\iJE"

We've obtained the phonetic representation of a Polish text (in IPA) – try reading that tongue twister aloud (in case of any problems consult this Wikipedia article).

We may also escape a selected set of code points (to hex representation as well as e.g. to XML entities) or even completely remove them:

stri_trans_general("zażółć gęślą jaźń", "[^\\u0000-\\u007f] any-hex") # filtered
## [1] "za\\u017C\\u00F3\\u0142\\u0107 g\\u0119\\u015Bl\\u0105 ja\\u017A\\u0144"
stri_trans_general("zażółć gęślą jaźń", "[^\\u0000-\\u007f] any-hex/xml")
## [1] "za&#x17C;&#xF3;&#x142;&#x107; g&#x119;&#x15B;l&#x105; ja&#x17A;&#x144;"
stri_trans_general("zażółć gęślą jaźń", "[\\p{Z}] remove")
## [1] "zażółćgęśląjaźń"

…and play with code point names:

stri_trans_general("ą1©,", "any-name")
stri_trans_general("\\N{LATIN SMALL LETTER SHARP S}", "name-any")
## [1] "ß"

Last but not least:

stri_trans_general("Let's go -- she said", "any-publishing")
## [1] "Let’s go — she said"

Did you note the differences?

A Note on BiDi Text (Help Needed)

ICU also provides support for processing Bidirectional text (e.g. a text that consists of a mixture of Arabic/Hebrew and English). We would be very glad to implement such facilities, but, as we (developers of stringi) come from a “Latin” environment, we don't have good guidelines on how the BiDi/RTL (right-to-left) text processing functions should behave. We don't even know whether such a text displays properly in RStudio or R GUI on Windows. Therefore, we'll warmly welcome any volunteers that would like to help us with the mentioned issues (developers or just testers).

For bug reports and feature requests concerning the stringi package visit our GitHub profile or contact me via email.


stri_trans_general("¡muy bueno mi amigo, hasta la vista! :-)", "es-es_FONIPA")
## [1] "¡muiβwenomiamiɣo,.astalaβista!:)"

Marek Gagolewski

Tagged with: , , , ,
Posted in Blog/R, Blog/R-bloggers

Counting the number of words in a LaTeX file with stringi

In my recent post I promised to present the most interesting features of the stringi package in more detail.

Here's one of such jolly features. Many LaTeX users may find it very useful.

Loading a text file with encoding auto-detection

Here's a LaTeX document consisting of a Polish poem. Probably, most of you wouldn't have been able to guess the file's character encoding if I hadn't left some hints. But it's OK, we have a little challenge.

Let's use some (currently experimental) stringi functions to guess the file's encoding.

First of all, we should read the file as a raw vector (anyway, each text file is a sequence of bytes).

# experimental function (as per stringi_0.2-5):
    dest = "powrot_taty_latin2.tex")
file <- stri_read_raw("powrot_taty_latin2.tex")
head(file, 15)
##  [1] 25 25 20 45 4e 43 4f 44 49 4e 47 20 3d 20 49

Let's try to detect the file's character encoding automatically.

stri_enc_detect(file)[[1]]  # experimental function
## $Encoding
## [1] "ISO-8859-2" "ISO-8859-1" "ISO-8859-9"
## $Language
## [1] "pl" "pt" "tr"
## $Confidence
## [1] 0.46 0.19 0.07

Encoding detection is, at best, an imprecise operation using statistics and heuristics. ICU indicates that most probably we deal with Polish text in ISO-8859-2 (a.k.a. latin2) here. What a coincidence: it's true.

Let's re-encode the file. Our target encoding will be UTF-8, as it is a “superset'' of all 8-bit encodings. We really love portable code:

file <- stri_conv(file, stri_enc_detect(file)[[1]]$Encoding[1], "UTF-8")
file <- stri_split_lines1(file)  # split a string into text lines
print(file[22:28])  # text sample
## [1] ",,Pójdźcie, o dziatki, pójdźcie wszystkie razem"
## [2] ""                                               
## [3] "Za miasto, pod słup na wzgórek,"                
## [4] ""                                               
## [5] "Tam przed cudownym klęknijcie obrazem,"         
## [6] ""                                               
## [7] "Pobożnie zmówcie paciórek."

Of course, if we knew a priori that the file is in ISO-8859-2, we'd just call:

file <- stri_conv(readLines(""), 
    "ISO-8859-2", "UTF-8")

So far so good.

Word count

LaTeX word counting is a quite complicated task and there are many possible approaches
to perform it. Most often, they rely on running some external tools (which may be a bit inconvenient for some users). Personally, I've always been most satisfied with the output produced by the Kile LaTeX IDE for KDE desktop environment.

LaTeX document statistics in Kile

As not everyone has Kile installed, I've had decided to grab Kile's algorithm (the power of open source!), made some not-too-invasive stringi-specific tweaks and here we are:

##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##          2283           335           576           461            32 
##        Envirs 
##             2

Some other aggregates are also available (they are meaningful in case of any text file):

##       Lines LinesNEmpty       Chars CharsNWhite 
##         232         122        3308        2930

Finally, here's the word count for my R programming book (in Polish). Importantly, each chapter is stored in a separate .tex file (there are 30 files), so "clicking out” the answer in Kile would be a bit problematic:

         pattern=glob2rx("*.tex"), recursive=TRUE, full.names=TRUE),
   ), 1, sum)
## CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs
##    718755        458403        281989        120202         37055          6119

Notably, my publisher was satisfied with the above estimate. :-)

Next time we'll take a look at ICU's very powerful transliteration services.

PS. There’s also a nice LuaTeX package called chickenize. Check that out.

More information

For more information check out the stringi package website and its on-line documentation.

For bug reports and feature requests visit our GitHub profile.

Any comments and suggestions are warmly welcome.

Marek Gagolewski

Tagged with: , , , ,
Posted in Blog/LaTeX, Blog/R, Blog/R-bloggers

(String/text processing)++: stringi 0.2-3 released

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

Read more ›

Tagged with: , , , ,
Posted in Blog/R, Blog/R-bloggers

ShareLaTeX now supports knitr

ShareLaTeX (click here to register a free account) is a wonderful and reliable on-line editor for writing and compiling LaTeX documents “in the cloud” as well as working together in real-time (imagine Google Docs supporting LaTeX => you get ShareLaTeX).

What is even more, the ShareLaTeX team recently announced that now its users are able to prepare documents using knitr! R version 3.0.2 is currently available.

Here is an exemplary chunk in an .Rtex file giving the list of installed packages:

%% begin.rcode
% cat(strwrap(paste(sort(rownames(installed.packages())), collapse=", "), 
%    width=80), sep='\n')
%% end.rcode

which results in:

## KernSmooth, MASS, Matrix, base, bitops, boot, class, cluster, codetools,
## compiler, datasets, digest, evaluate, foreign, formatR, grDevices, graphics,
## grid, highr, knitr, lattice, markdown, methods, mgcv, nlme, nnet, parallel,
## rpart, spatial, splines, stats, stats4, stringr, survival, tcltk, testit,
## tools, utils

Installation of new packages has been disabled (as well as any access to on-line resources). However, you may upload your data sets (e.g. in CSV format) to your projects and read them with R commands. It seems that the compilation takes place in a chrooted-like environment, so it is secure.


Tagged with: , , , ,
Posted in Blog/LaTeX, Blog/R, Blog/R-bloggers, Blog/R-knitr

FuzzyNumbers_0.3-3 released

A new release of the FuzzyNumbers package for R is now available on CRAN. The package provides S4 classes and methods to deal with Fuzzy Numbers that allow for computations of arithmetic operations, approximation by trapezoidal and piecewise linear FNs, visualization, etc.
Read more ›

Tagged with: ,
Posted in Blog/News, Blog/R, Blog/R-bloggers

stringi 0.1-11 available for testing

We have prepared binary builds of the latest version of our stringi processing package (Windows i386/x86_64 for R 2.15.X and 3.0.X and OS X x86_64 for R 3.0.X). The are – together with the source package – available with the on-line installer:

Tagged with: , ,
Posted in Blog/News, Blog/R

stringi_0.1-9 now available for testing

stringi is THE R package for correct, fast, and simple string processing in each locale and native charset.

Another alpha release (for testing purposes) can be automatically downloaded by calling in R:


The auto-installer gives access to a Windows i386/x64 build for R 3.0 or allows building the package from sources on Linux or MacOS.

Tagged with: , ,
Posted in Blog/News, Blog/R

stringi-0.1 **alpha release**

The stringi package has been successfully built on Windows/i386 and Windows/x64. Today’s patches give high possibility that the package will smoothly go through CRAN checks & Win/compile process.

Version 0.1-6 includes an automatic installer for our ICU4C lib build for Windows.

Current source tarball and Windows binary package are available at

Have fun testing!

Tagged with: , ,
Posted in Blog/News, Blog/R