Sclera is now Open-Source!

Sclera is now open-source, with the code available on GitHub under Apache License version 2.0. It is also easier to install and maintain, thanks to a brand new command-line administration tool, scleradmin.

Making the source open gave us an opportunity to carefully revisit the code, and in the process make it leaner and better structured. We also removed some plugins based on what is now legacy software, such as Apache Mahout and Apache Pig over Apache HBase. The latter, at some point, will be replaced by a far more general plugin based on Apache Drill.

Now that the code is open, do have a peek inside and see if anything in there could be useful to what you are working on. Among other things, you will find:

The code is a mix of object-oriented and functional programming. There is a significant bias towards idiomatic functional programming (map, fold, etc.), but a fair number of stateful constructs (prominently, iterators) are used as well for the sake of efficiency.

Bug reports and suggestions are appreciated. Contributions welcome, especially in terms of useful and innovative plugins.

Hope you have as much fun using Sclera as we had building it!

The Rise of the Declarative

There is a growing trend to make analytics application development more declarative. Prominent examples, focused on accessing and transforming data, are dplyr for R, and Ibis for Python.

This is a move forward from the imperative way of developing analytics applications. The idea is to make the code more composable, reusable, and easier to write.

However, these libraries do not gel with the imperative nature of the underlying language — the developers need to switch to declarative thinking when using these libraries, and switch back to imperative thinking for other parts of the code. This is not ideal.

Rather than graft declarative constructs in an imperative language, Sclera extends SQL.

SQL has been used for declarative data wrangling and transformation for decades, and is familiar to almost everybody who has ever worked in data management or business intelligence. A recent survey shows that SQL continues to be tremendously popular among developers.

Sclera’s scripting language, ScleraSQL, provides SQL extensions that help express complex analytics tasks in tens of lines instead of hundreds.

ScleraSQL includes extensions for streaming data access, data transformation, data cleaning, machine learning, and pattern matching, as well as “Grammar of Graphics” constructs for declarative visualization, similar to R’s ggplot2.

Data analytics is an extension of business intelligence. It makes sense, therefore, that your analytics language of choice is an extension of SQL.

Hiring Data Scientists – Why Compromise?

The following is a popular definition of a data scientist:

According to this definition, a data scientist is not necessarily the best statistician, and not necessarily the best software engineer.

Hiring such data scientists is clearly a compromise. The reason behind the compromise is that a data scientist needs to juggle both software engineering and statistics – so being great at one of these but not at the other might not work out for the best.

It does not have to be that way.

If I were building a data science team, I would rather hire the best statisticians and the best software engineers, and have them work together.

How can they work together?

One way is to separate the analytics logic (the “what”) and the engineering aspects (the “how”). The statisticians can then work on the analytics logic, while the engineers work on the engineering.

In Sclera, this is facilitated by high-level building blocks for data access, data transformation, data cleaning, machine learning, pattern matching, and visualization.

Statisticians specify the analytics logic by building a pipeline of these building blocks, while the engineers provide implementations of these building blocks.

From the statistician’s point of view, this results in greater productivity. The high-level analytics specification is only a few lines of ScleraSQL code – easy to write, and easy to modify for iterative experimentation. Sclera optimizes the code automatically, ensuring the best performance on the available resources.

From the engineer’s point of view, the problem is well-defined – build the most efficient implementation of a building block. The semantics are clear, so no distractions in terms of ever-changing specifications, and the code is reused in a structured manner across multiple applications.

Sclera comes with a number of pre-packaged building blocks, and an SDK which can be used to write additional building blocks.

Sclera helps you get the best out of your statisticians, and the best out of your engineers. So why compromise on the hiring?