Saturday, 10 January 2009

Programmers Drawn to the Power of R

Uma das coisas que faço sempre quando estou em Israel é comprar à sexta feira o Haaretz para ler no fim de semana. Mais exactamente acabo por comprar o International Herald Tribune porque tem a versão do Haaretz em inglês - que se às vezes demoro quase a semana toda para ler a versão em inglês, se fosse em hebreu nem quero pensar quanto tempo demoraria... mas o que gosto de ler mesmo é o Haaretz, não tanto pelas notícias de Israel (que para isso vejo online - quem precisa de jornal quando há o ynet ?!) mas principalmente pelos suplementos de fim de semana do Haaretz, que são fabulosos. Antes de atacar os suplementos, que leio aos bocadinhos como quem come quadradinhos de chocolate, costumo passar os olhos pelo resto, sem grande entusiasmo... mas hoje fui surpreendida por uma noticia inesperada no Herald Tribune... sobre o R! Com uma fotografia e tudo... não consegui encontrar online, mas aqui fica o texto...

Programmers Drawn to the Power of R
(Source: International Herald Tribune, by Ashlee Vance)

R, a free, open-source programming language, is fast becoming the lingua franca for a growing number of data analysts in corporations and academia.

As data mining enters a golden age - it is used to set ad prices, find new drugs more quickly and fine-tune financial models, among other tasks - R has been adopted by companies ranging from Google to Shell, and from Pfizer and Merck to Bank of America and the InterContinental Hotels Group.

But R has also quickly found a following because statisticians, engineers and scientists without computer programming skills find it easy to use.

While it is difficult to calculate exactly how many people use R, those most familiar with the software estimate that nearly 250,000 people work with it regularly. The popularity of R at universities could threaten SAS Institute, the privately held business software company that specializes in data analysis software. SAS, with more than $2 billion in annual revenue, has been the preferred tool of scholars and corporate managers.

"R has really become the second language for people coming out of grad school now, and there's an amazing amount of code being written for it," said Max Kuhn, associate director of nonclinical statistics at Pfizer. "You can look on the SAS message boards and see there is a proportional downturn in traffic."

SAS says it has noticed R's rising popularity at universities, despite educational discounts on its own software, but it dismisses the technology as being of interest to a limited set of people working on very hard tasks.

"I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne Milley, director of technology product marketing at SAS.

She added, "We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet."

But while SAS plays down R's corporate appeal, companies like Google and Pfizer say they use the software for just about anything they can.

Google, for example, taps R for help in understanding trends in ad pricing and for illuminating patterns in the search data it collects.

Pfizer has created customized packages for R to let its scientists manipulate their own data during nonclinical drug studies rather than send the information off to a statistician.

"R is really important - to the point that it's hard to overvalue it," said Daryl Pregibon, a research scientist at Google. "It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems."

It is also free. R is an open-source program, and its popularity reflects a shift in the type of software used inside companies. Open- source software is free for anyone to use and modify.

International Business Machines, Hewlett-Packard and Dell make billions of dollars a year selling servers that run the open-source Linux operating system, which competes with Windows from Microsoft.

Most Web sites are displayed using an open-source application called Apache, and companies increasingly rely on the open-source MySQL database to store critical information. Many people view the end results of all this technology when using the Firefox Web browser, an open-source browser.

R is similar to other programming languages, like C, Java and Perl, in that it helps people perform a wide variety of computing tasks by giving them access to various commands. For statisticians, however, R is particularly useful because it contains a number of built-in mechanisms for organizing data, running calculations on the information and creating graphical representations of data sets.

Some people familiar with R describe it as a supercharged version of Microsoft's Excel spreadsheet software that can help illuminate data trends more clearly than is possible by entering information into rows and columns.

What makes R so useful - and helps explain its quick acceptance - is that statisticians, engineers and scientists can improve the software's code or write variations for specific tasks. Packages written for R add advanced algorithms, colored and textured graphs and mining techniques to dig more deeply into databases.

Nearly 1,600 different packages reside on just one of the many Web sites devoted to R, and the number of packages has grown exponentially. One package, called BiodiversityR, offers a graphical interface to make calculations of environmental trends easier.

"The great beauty of R is that you can modify it to do all sorts of things," said Hal Varian, chief economist at Google.

R first appeared in 1996, when two statistics professors, Ross Ihaka and Robert Gentleman of the University of Auckland in New Zealand, released the code as a free software package.

According to them, the notion of devising something like R sprang up during a hallway conversation. They both wanted technology better suited for their statistics students, who needed to analyze data and produce graphical models of the information. Most comparable software had been designed by computer scientists and proved hard to use.

Lacking deep computer science training, the professors considered their coding efforts more of an academic game than anything else. Nonetheless, starting in about 1991, they worked on R full time.

"We were pretty much inseparable for five or six years," Gentleman said. "One person would do the typing and one person would do the thinking."


Originally published by The New York Times Media Group. (c) 2009 International Herald Tribune