Big Data Programming Languages

Tuesday, November 29, 2016

Some thoughts about advantages and disadvantages of some Big Data Programming Languages: R, Python, Julia

 

R

R, which has been around since 1993, has long been considered the go-to programming language for data science and statistical computing. It was designed first and foremost to carry out matrix calculations – standard arithmetic functions applied to numerical data arranged in rows and columns.

R can be used to automate huge numbers of these calculations, even when the row and column data is constantly changing or growing. It also makes it very easy to produce visualizations based on these calculations. The combination of these features has made it an extremely popular choice for crafting data science tools.

Because R has been around for a while, it has a large and active community of users and enthusiasts.

 

Python

Python is far more general purpose than R and will be more immediately familiar to anyone who has used object-oriented programming languages before.

Python’s sheer popularity has helped cement its place as the second most common tool for data science – and although it may not be quite as widely used as R, its user base has been growing at a greater rate. It’s certainly easier to get used to than R if you don’t already have a solid background in statistical computing.

Python’s user base has devoted itself to producing extensions and libraries aimed at helping it match the usefulness of R when it comes to data wrangling.
This attracted coders interested in analytics and statistics to the language, and over the years it has led to the development of more and more complex functions and methodologies.

Because of this, Python has become a popular choice for applications using the most cutting edge techniques, such as machine learning and natural language processing.
However, if you’re only interested in more traditional analytical and statistical computing, then you may find that R presents a more complete and integrated development environment than Python.

 


Julia

Like Python and R, Julia is built for scalability and speed of operation when handling large data sets. It was designed with a “best of all worlds” ethos — the idea was that it would combine the strengths of other popular analytics-oriented programming languages. One key influence was the widely used statistical programming language MATLAB, with which it shares much of its syntax.

Julia has specific features built into the core language that make it particularly suitable for working with the real-time streams of Big Data industry wants to leverage these days, such as parallelization and in-database analytics. The fact that code written in Julia executes very quickly adds to its suitability here.

Its ecosystem of extensions and libraries is not as mature or developed as it is for the more established languages. It is getting there, however, and most popular functions are available, with more emerging at a steady rate.


The Right Tool for the Job

From a general perspective, it may seem that R would be the natural choice for running large numbers of calculations against big-volume datasets, Python would be the go-to for advanced analytics involving AI or ML, and Julia a natural fit for projects involving in-database analytics on real-time streams.


All of the languages here are living projects that are constantly evolving and updated to be capable of new things. Each has its strengths and weaknesses, but they are all robust choices for enterprise initiatives involving Big Data and analytics.

 

/GW