A recent survey finds lack of skills and lack of understanding of technology as the primary barriers to analytics. Over 52% of the respondents cited lack of skills, while 33% cited technology as the challenge.
This is not a new finding — and highlights the usability gap that the available analytics platforms have not been able to bridge, even years after advanced analytics came into the limelight.
In this post, we take a step back, and analyze what needs to be done. Our approach will be targeted towards users who know the rudiments of business analytics, and would rather focus on the analytics tasks than care for what lies under the hood.
Simplification by Separation of Concerns
What exactly are we looking to simplify? Using advanced analytics platforms today needs technical skills in the following areas:
- Machine Learning: Expertise in the statistical, mathematical and algorithmic aspects of analytics. This is a deep science, and requires years of mathematical training to build the insights.
- Software and Systems: Expertise in the systems and operational aspects of analytics; when dealing with large amounts of data, this includes expertise in the “big-data stack”. This is deep engineering, and requires mastery over building systems that work correctly, reliably, and as efficiently as possible.
As we can expect, individuals with high competence in either of these skills are not easy to find — and those with high competence in both are far rarer.
Moreover, it is not as if expertise in just one of the skills, say machine learning, makes things much easier. The following post shows that software engineering is more than simple coding skills, just like machine learning is more than simple arithmetic.
I am a data scientist/analyst, and my day to day is entirely in python/scikit-learn/pandas, data munging and running models. Right now my code is several hundred lines of data processing steps, filtering, lots and lots of joins and sql queries, pickle dumps and loads, print array.shape. […] Long story short, I have a physics background and was never taught how to properly structure my workflow for this type of coding. elliott34, Hacker News, 3 Dec 2014
Our approach in this post is to enable separation of concerns — that is, divide the role of a “data scientist” into an analyst and a systems engineer, and provide a framework that enables them to work together. In doing so, we reduce the technical requirements for the analyst to the extent possible.
Sounds impractical? Actually, this has been done in the past, and with great success, in the context of data management. To understand how, let us quickly review the evolution of database management systems.
Where have we seen that before?
At its inception, in the 1960’s, data management was the domain of systems engineers — out of reach of the intended users: the business analysts.
Several database products did indeed exist at that time; however, they were without exception ad hoc, cumbersome, and difficult to use—they could really only be used by people having highly specialized technical skills—and they rested on no solid theoretical foundation. E. F. Codd’s biography, by C. J. Date
Clearly, data management was in the same state of affairs as advanced analytics is in now.
The need to simplify these systems motivated several efforts, culminating in E. F. Codd’s proposal of the relational model.
The relational model enabled queries over the data to be structured as a dataflow, using the relational algebra. The brilliance of relational algebra was in identifying a small number of primitives (relational operators — SELECT, PROJECT, JOIN, etc.) such that the majority of queries could be expressed as a composition of these primitives applied on the input data.
This simplified ad-hoc data access dramatically, as it separated the “logical” specification from the “physical” evaluation.
- The focus of the “logical” specification was to capture the user requirement in terms of well-formed queries. As mentioned earlier, these queries were expressed in terms of the “high-level” relational operators, and were separated from the concerns of evaluation. Soon enough, accessible languages (prominently, SQL) and friendly graphical interfaces were developed to further simply the creation of these queries. These interfaces were readily picked up by business analysts.
- The focus of the “physical” evaluation was to evaluate the logical specification as efficiently as possible. The systems engineers contributed to this layer. They provided efficient implementation of the relational operators in the database system. Also, in their role as “database administrators”, they ensured that the system worked efficiently and reliably, and also tuned the data layout for efficient evaluation.
The value of this logical-physical separation is apparent in the success relational databases have enjoyed over the years.
Making logical-physical separation work for analytics
So, how can we achieve a similar separation of concerns in advanced analytics? By building an algebra for advanced analytics, and using it to separate the “logical” analytics specification from the “physical” analytics evaluation, including computation on the big-data stack.
Initially, let us assume that the data to be analyzed is stored in a relational database (we will relax this constraint later). Then, it makes sense to develop the analytics algebra as an extension of the relational algebra. This implies that the analytics operators should take tables as input, and emit tables as output — just like relational operators.
We start by incorporating analytics constructs, such as classifiers, as first class objects — at par with tables and views — and provide an interface to create such objects.
Recall that when creating a table in a relational database, the user does not think much about whether the table will be stored as a “B+ Tree” or a “Heap File”. The user simply states the table’s properties — the set of attributes, primary/foreign keys and other constraints, and optionally provides the query which will be used to populate the table.
Likewise, for the purpose of the specification, the analytics objects are abstract; the details of the underlying structures and statistical model are implementation details, and should not be the user’s concern. The user should only need to provide the configuration parameters and the query that emits the training data, and the system should train the object’s model using the same.
For instance, when creating a classifier, the user provides the query for the training data, and parameters identifying the target and feature columns in the output of the query.
Next, we define analytics operators that “apply” these objects on new data points. For instance, after a classifier is built, it is used to assign class labels to new data points (input rows) — this can be captured as a relational operator that augments the input rows with a new column containing the assigned class label.
Similarly, we can incorporate “clusterer” objects, and define operators that assign clusters to input rows. For text analytics, we can have “entity extractors” as objects, and define operators that extract entities from text-valued columns in the input rows. And so on.
As with database systems, the users need not use the proposed analytics interface directly. We can extend standard SQL with statements that parse to the analytics tasks (e.g. creating a classifier), and clauses that parse to operators (e.g. assigning class labels using the classifiers).
For example, in Sclera, the following statement trains a classifier
myclassifier for identifying prospects using a survey on customers:
create classifier myclassifier(isinterested) using
select survey.isinterested, customers.location, customers.salary
from survey join customers on (survey.custid = customers.id);
isinterested is specified as the classifier’s target column, and the remaining columns,
salary, become the features. The following query then uses
myclassifier to identify prospects among target customers, putting the prediction in the column
select email, name, isprospect
from (targets classified with myclassifier(isprospect));
Alternatively, as with database systems, we can build graphical user interfaces to create the tasks and queries interactively.
The physical evaluation provides concrete implementations for the specification abstractions — that is, the analytics objects and operators.
Let us consider the analytics objects first. How do we choose an implementation for, say, a classifier? There are a number of alternatives — decision trees, naive bayes, and so on.
Ideally, given the configuration parameters and available data descriptions, the system should automatically identify which analytics implementation to use — but since this is a tough call, a more pragmatic approach is to have a default implementation, and provide interface parameters that enable the user to override this default.
Continuing our classifier example, the following figure shows the creation and training of the classifier using Sclera. Since the specification does not include an override, the classifier implementation is taken to be a decision tree.
The analytics object implementations can be built from scratch, or taken from an off-the-shelf analytics library such as Weka or Apache Spark / MLlib, or even wrap over a cloud service such as Google Prediction API.
The analytics operations are evaluated using the methods provided by the object implementations. The evaluation may involve transforming the input data to the structure required by these methods, and transforming the result so that it can be included in the operator’s output.
The figure below illustrates the application of the trained classifier
myclassifier in our example. The “classify” operator gets translated to
myclassifier‘s decision tree-based classification function. For each row of the input table
targets containing target customers, this function uses the values in feature columns
location to compute a value denoting the customer’s potential interest, which is put in the new column
Separating analytics specification from evaluation gives us the choice to arbitrarily use one or more analytics engines — developed in-house, off-the-shelf, a cloud-based web-service, or a combination thereof — as implementations of the analytics operators. As long as the specification is not changed, the evaluation can also switch across the backends without affecting the application. We call this analytics virtualization.
Likewise, for relational queries and statements. As long as the relational interface is maintained, the evaluation can be pushed across to one or more data platforms, relational or non-relational, without affecting the application. This is data virtualization, and is implemented by building drivers that enable the underlying data to be accessed as tables, and also evaluate the relational operators and statements on these data platforms.
With data virtualization support, we can now remove our assumption that the data being analyzed is stored in a relational database. Since the data can be accessed using a virtual relational interface, it can actually reside across relational databases (e.g. MySQL, PostgreSQL), non-SQL databases (e.g. MongoDB, Apache HBase), HDFS, the local file system, or even a web service.
In this post, we outlined a principled, declarative approach towards building easier to use analytics platforms. The idea is to enable logical-physical separation, which has worked very well in the past in context of database systems.
Several vendors in the past have attempted extending SQL with analytics capabilities, including Oracle and, more recently, Metanautix. In Oracle’s solution, model creation is not integrated with SQL, and needs to be done in a PL/SQL routine. Metanautix’s approach is primarily imperative, in contrast to the declarative approach presented above.
The advantages of the declarative approach over these imperative alternatives will be the topic of another post.
Meanwhile, comments are welcome!