Eskapade: Modular Analytics¶
Date: Feb 28, 2018
Web page: http://eskapade.kave.io
Code reference: API Documentation
Issues & Ideas: https://github.com/kaveio/eskapade/issues
Q&A Support: contact us at: kave [at] kpmg [dot] com
Eskapade is a light-weight, python-based data analysis framework, meant for modularizing all sorts of data analysis problems.
In particular, Eskapade can be used as a self-learning framework for typical machine learning problems. Trained algorithms can predict real-time or batch data, these models can be evaluated over time, and Eskapade can bookkeep and retrain their algorithms.
Eskapade uses a modular approach to analytics, meaning that you can use pre-made operations (called ‘links’) to build an analysis. This is implemented in a chain-link framework, where you define a ‘Chain’, consisting of a number of Links. These links are the fundamental building block of your analysis. For example, a data loading link and a data transformation link will frequently be found together in a pre-processing Chain.
Each chain has a specific purpose, for example: data quality checks of incoming data, pre-processing of data, booking and/or training of predictive algorithms, validation of these algorithms, and their evaluation. By using this work methodology, analysis links can be more easily reused in future data analysis projects.
Eskapade is analysis-library agnostic. It is used to set up typical data analysis problems from multiple packages, e.g.: scikit-learn, Spark MLlib, and ROOT. Likewise, Eskapade can use a manner of different data structures to handle data, such as: pandas DataFrames, numpy arrays, Spark DataFrames/RDDs, and more.
Version 0.7 of Eskapade (February 2018) contains several major updates:
The Eskapade code has been made pip friendly. One can now simply do:
$ pip install Eskapade
or check out the code from out github repository:
$ git clone firstname.lastname@example.org:KaveIO/Eskapade.git $ pip install -e Eskapade/
where in this example the code is installed in edit mode (option -e).
You can now use Eskapade in Python with:
This change has resulted in some restructuring of the python directories, making the overall structure more transparent: all python code, including the tutorials, now fall under the (single)
python/directory. Additionally, thanks to the pip convention, our prior dependence on environment variables (
$ESKAPADE) has now been fully stripped out of the code.
There has been a cleanup of the core code, removing obsolete code and making it better maintainable. This has resulted in a (small) change in the api of the process manager, adding chains, and using the logger. All tutorials and example macro files have been updated accordingly. See the migration section for detailed tips on migrating existing Eskapade code to version 0.7.
All eskapade commands now start with the prefix
eskapade_. All tutorials have been updated accordingly. We have the commands:
eskapade_bootstrap, for creating a new Eskapade analysis project. See this new tutorial for all the details.
eskapade_run, for running the Eskapade macros.
eskapade_trail, for running the Eskapade unit and integration tests.
eskapade_generate_notebook, for generating a new link, macro, or Jupyter notebook respectively.
The primary feature of version 0.6 (August 2017) is the inclusion of Spark, but this version also includes several other new features and analyses.
We include multiple Spark links and 10 Spark examples on:
- The configuration of Spark, reading, writing and converting Spark dataframes, applying functions and queries to dataframes, filling histograms and (very useful!) applying arbitrary functions (e.g. pandas) to groupby calls.
In addition we hade added:
- A ROOT analysis for studying and quantifying between sets of (non-)categorical and observables. This is useful for finding outliers in arbitrary datasets (e.g. surveys), and we include a tutorial of how to do this.
- A ROOT analysis on predictive maintenance that decomposes a distribution of time difference between malfunctions by fitting this multiple Weibull distributions.
- New flexible features to create and chain analysis reports with several analysis and visualization links.
Our 0.5 release (May 2017) contains multiple new features, in particular:
- Support for ROOT, including multiple examples on using data analysis, fitting and simulation examples using RooFit.
- Histogram conversion and filling support, using ROOT, numpy, Histogrammar and Eskapade-internal histograms.
- Automated data-quality fixes for buggy columns datasets, including data type fixing and NaN conversion.
- New visualization utilities, e.g. plotting multiple types of (non-linear) correlation matrices and dendograms.
- And most importantly, many new and interesting example macros illustrating the new features above!
In our 0.4 release (Feb 2017) we are releasing the core code to run the framework. It is written in python 3. Anyone can use this to learn Eskapade, build data analyses with the link-chain methodology, and start experiencing its advantages.
The focus of the provided documentation is on constructing a data analysis setup in Eskapade. Machine learning interfaces will be included in an upcoming release.
- Command Line Arguments
- Package structure
- Developing and Contributing