Tutorials

This section contains materials on how to use Eskapade. There are additional side notes on how certain aspects work and where to find parts of the code. For more in depth explanations on the functionality of the code-base, try the API-docs.

Running your first macro

After successfully installing Eskapade, it is now time to run your very first macro, the classic code example: Hello World!

Hello World!

If you just want to run it plain and simple, go to the root of the repository and run the following:

$ source setup.sh
$ run_eskapade.py ./tutorials/esk101_helloworld.py

This will run the macro that prints out Hello World. There is a lot of output, but try to find back these lines (or similar):

2017-02-27 20:32:19,826 INFO [hello_world/execute]: Hello World
2017-02-27 20:32:19,828 INFO [hello_world/execute]: Hello World

Congratulations, you have just successfully run Eskapade!

Internal workings

To see what is actually happening under the hood, go ahead and open up /tutorials/esk101_helloworld.py. The macro is like a recipe and it contains all of your analysis. It has all the 'high level' operations that are to be executed by Eskapade.

When we go into this macro we find the following piece of code:

link = core_ops.HelloWorld(name='HelloWorld')
link.set_log_level(logging.DEBUG)
link.repeat = settings['n_repeat']
ch.add_link(link)

Which is the code that does the actual analysis (in this case, print out the statement). In this case link is an instance of the class HelloWorld, which itself is a Link. The Link class is the fundamental building block in Eskapade that contains our analysis steps. The code for HelloWorld can be found at:

$ less $ESKAPADE/python/eskapade/core_ops/links/hello_world.py

Looking into this class in particular, in the code we find in the execute() function:

self.log().info('Hello {0}'.format(self.hello))

where self.hello is a parameter set in the __init__ of the class. This setting can be overwritten as can be seen below. For example, we can make another link, link2 and change the default self.hello into something else.

link2 = core_ops.HelloWorld(name='Hello2')
link2.hello = 'Lionel Richie'
ch.add_link(hello2)

Rerunning results in us greeting the famous singer/songwriter.

There are many ways to run your macro and control the flow of your analysis. You can read more on this in the Short introduction to the Framework subsection below.

Tutorial 1: transforming data

Now that we know the basics of Eskapade we can go on to more advanced macros, containing an actual analysis.

Before we get started, we have to fetch some data, on your command line, type:

$ wget -P $ESKAPADE/data/ https://s3-eu-west-1.amazonaws.com/kpmg-eskapade-share/data/LAozone.data

To run the macro type on your CLI:

$ run_eskapade.py tutorials/tutorial_1.py

If you want to add command line arguments, for example to change the output logging level, read the page on command line arguments.

When looking at the output in the terminal we read something like the following:

* * * Welcome to Eskapade * * *
...
2017-02-10 15:24:35,968 INFO [processManager/Print]: Number of chains:    2
...
* * * Leaving Eskapade. Bye! * * *

There is a lot more output than these lines (tens or hundred of lines depending on the log level). Eskapade has run the code from each link, and at the top of the output in your terminal you can see a summary.

When you look at the output in the terminal you can see that the macro contains two chains and a few Link are contained in these chains. Note that chain 2 is empty at this moment. In the code of the macro we see that in the first chain that data is loaded first and then a transformation is applied to this data.

Before we are going to change the code in the macro, there will be a short introduction to the framework.

Short introduction to the Framework

At this point we will not go into the underlying structure of the code that is underneath the macro, but later in this tutorial we will. For now we will take a look in the macro. So open tutorials/tutorial_1.py in your favorite editor. We notice the structure: first imports, then defining all the settings, and finally the actual analysis: Chains and Links. There are two chains added to the macro, with following line you can add a chain:

proc_mgr.add_chain('Data')

This chain called Data is added to the ProcessManager, which is the object that runs the entire macro. Then the chain is fetched by:

proc_mgr.get_chain('Data')

and a Link is added. First the link is initialized (links are classes) and its properties are set, and finally it is inserted into the chain:

reader = analysis.ReadToDf(name='Read_LA_ozone', path=DATA_FILE_PATH, reader=pd.read_csv, key='data')
proc_mgr.get_chain('Data').add_link(reader)

This means the Link is added to the chain and when Eskapade runs, it will execute the code in the Link.

Now that we know how the framework runs the code on a higher level, we will continue with the macro.

In the macro notice that under the second chain some code has been commented out. Uncomment the code and run the macro again with:

$ run_eskapade.py tutorials/tutorial_1.py

And notice that it takes a bit longer to run, and the output is longer, since it now executes the Link in chain 2. This Link takes the data from chain 1 and makes plots of the data in the data set and saves it to your disk. Go to this path and open one of the pdfs found there:

$ results/Tutorial_1/data/v0/report/

The pdfs give an overview of all numerical variables in the data in histogram form. The binning, plotting and saving of this data is all done by the chain we just uncommented. If you want to take a look at how the Link works, it can be found in:

$ python/eskapade/visualization/links/df_summary.py

But for now, we will skip the underlying functionality of the links.

Let's do an exercise. Going back to the first link, we notice that the transformations that are executed are defined in conv_funcs passed to the link. We want to include in the plot the wind speed in km/h. There is already a part of the code available in the conv_funcs and the functions comp_date and mi_to_km. Use these functions as examples to write a function that converts the wind speed.

Add this to the transformation by adding your own code. Once this works you can also try to add the temperature in degrees Celsius.

Tutorial 3: Jupyter notebook

This section contains materials on how to use Eskapade in Jupyter Notebooks. There are additional side notes on how certain aspects work and where to find parts of the code. For more in depth explanations, try the API-docs.

Next we will demonstrate how Eskapade can be run and debugged interactively from within a jupyter notebook. Do not forget to set up the environment before starting the notebook (and in case you use a virtual environment activate it):

$ cd $ESKAPADE
$ source setup.sh
$ cd some-working-dir
$ jupyter notebook

An Eskapade notebook

To run Eskapade use the make_notebook.sh script in scripts/ to create a template notebook. For example:

$ make_notebook.sh ./ TestRun

The minimal code you need to run a notebook is the following:

import imp
import logging
imp.reload(logging)
log = logging.getLogger()
log.setLevel(logging.DEBUG) # Set the LogLevel here

from eskapade.core import execution
from eskapade import ConfigObject, DataStore, ProcessManager

# --- basic config
settings = ProcessManager().service(ConfigObject)
settings['macro'] = os.environ['ESKAPADE'] + '/tutorials/tutorial_1.py'
settings['analysisName'] = 'Tutorial_1'
settings['version'] = 0
settings['logLevel'] = logging.DEBUG # and set the LogLevel here

# --- optional running parameters
#settings['beginWithChain'] = 'startChain'
#settings['endWithChain'] = 'endChain'
#settings['resultsDir'] = 'resultsdir'
settings['storeResultsEachChain'] = True

# --- other global flags (just some examples)
settings['set_mongo'] = False
settings['set_training'] = False

# --- run eskapade!
execution.run_eskapade(settings)

# --- To rerun eskapade, clear the memory state first!
#execution.reset_eskapade()

Make sure to fill out all the necessary parameters for it to run. The macro has to be set obviously, but not all settings in this example are needed to be set to a value. The function execution.run_eskapade(settings) runs Eskapade with the settings your specified.

To inspect the state of the Eskapade objects (DataStore and Configurations) after the various chains see the command line examples below. .. note:

Inspecting intermediate states requires Eskapade to be run with the option storeResultsEachChain
(command line: ``-w``) on.
import imp
import logging
imp.reload(logging)
log = logging.getLogger()
log.setLevel(logging.DEBUG)

from eskapade import DataStore, ConfigObject, ProcessManager

# --- example inspecting the data store after the preprocessing chain
ds = DataStore.import_from_file(os.environ['ESKAPADE']+'/results/Tutorial_1/proc_service_data/v0/_Summary/eskapade.core.process_services.DataStore.pkl')
ds.keys()
ds.Print()
ds['data'].head()

# --- example showing Eskapade settings
co = ConfigObject.import_from_file(os.environ['ESKAPADE']+'/results/Tutorial_1/proc_service_data/v0/_Summary/eskapade.core.process_services.ConfigObject.pkl')
co.Print()

The import_from_file function imports a pickle file that was written out by Eskapade, containing the DataStore. This can be used to start from an intermediate state of your Eskapade. For example, you do some operations on your DataStore and then save it. At a later time you load this saved DataStore and continue from there.

Running in a notebook

In this tutorial we will make a notebook and run the macro from tutorial 1. This macro shows the basics of Eskapade. Once we have Eskapade running in a terminal, we can run it also in jupyter. Make sure you have properly installed jupyter.

We start by making a notebook:

$ make_notebook.sh tutorials/ tutorial_3_notebook

This will create a notebook in tutorials/ with the name tutorial_3_notebook running macro tutorial_1.py. Now open jupyter and take a look at the notebook.

$ jupyter notebook

Try to run the notebook. You might get an error if the notebook can not find the data for the data reader. Unless you luckily are in the right folder. Use:

!pwd

In Jupyter to find which path you are working on, and change the load path in the macro to the proper one. This can be for example:

os.environ['ESKAPADE'] + '/data/LAozone.data'

but in the end it depends on your setup.

Intermezzo: you can run bash commands in jupyter by prepending the command with a !

Now run the cells in the notebook and check if the macro runs properly. The output be something like:

2017-02-14 14:04:55,506 DEBUG [link/execute_link]: Now executing link 'LA ozone data'
2017-02-14 14:04:55,506 DEBUG [readtodf/execute]: reading datasets from files ["../data/LAozone.data"]
2017-02-14 14:04:55,507 DEBUG [readtodf/pandasReader]: using Pandas reader "<function _make_parser_function.<locals>.parser_f at 0x7faaac7f4d08>"
2017-02-14 14:04:55,509 DEBUG [link/execute_link]: Done executing link 'LA ozone data'
2017-02-14 14:04:55,510 DEBUG [link/execute_link]: Now executing link 'Transform'
2017-02-14 14:04:55,511 DEBUG [applyfunctodataframe/execute]: Applying function <function <lambda> at 0x7faa8ba2e158>
2017-02-14 14:04:55,512 DEBUG [applyfunctodataframe/execute]: Applying function <function <lambda> at 0x7faa8ba95f28>
2017-02-14 14:04:55,515 DEBUG [link/execute_link]: Done executing link 'Transform'
2017-02-14 14:04:55,516 DEBUG [chain/execute]: Done executing chain 'Data'
2017-02-14 14:04:55,516 DEBUG [chain/finalize]: Now finalizing chain 'Data'
2017-02-14 14:04:55,517 DEBUG [link/finalize_link]: Now finalizing link 'LA ozone data'
2017-02-14 14:04:55,518 DEBUG [link/finalize_link]: Done finalizing link 'LA ozone data'
2017-02-14 14:04:55,518 DEBUG [link/finalize_link]: Now finalizing link 'Transform'
2017-02-14 14:04:55,519 DEBUG [link/finalize_link]: Done finalizing link 'Transform'
2017-02-14 14:04:55,519 DEBUG [chain/finalize]: Done finalizing chain 'Data'

with a lot more text surrounding this output. Now try to run the macro again. The run should fail, and you get the following error:

RuntimeError: tried to add chain with existing name to process manager

This is because the ProcessManager is a singleton. This means there is only one of this in memory allowed, and since the jupyter python kernel was still running the object still existed and running the macro gave an error. The macro tried to make a singleton, but it already exists. Therefore the final line in the notebook template has to be ran every time you want to rerun Eskapade. So run this line:

execution.reset_eskapade()

And try to rerun the notebook without restarting the kernel.

execution.run_eskapade(settings)

If one wants to call the objects used in the run, execute contains them. For example calling

ds = ProcessManager().service(DataStore)

is the DataStore, and similarly the other 'master' objects can be called. Resetting will clear the process manager singleton from memory, and now the macro can be rerun without any errors.

Note: restarting the jupyter kernel also works, but might take more time because you have to re-execute all of the necessary code.

Reading data from a pickle

Continuing with the notebook we are going to load a pickle file that is automatically written away when the engine runs. First we must locate the folder where it is saved. By default this is in:

ESKAPADE/results/$MACRO/proc_service_data/v$VERSION/latest/eskapade.core.process_services.DataStore.pkl'

Where $MACRO is the macro name you specified in the settings, $VERSION is the version you specified and latest refers to the last chain you wrote to disk. By default, the version is 0 and the name is v0 and the chain is the last chain of your macro.

You can write a specific chain with the command line arguments, otherwise use the default, the last chain of the macro.

Now we are going to load the pickle from tutorial_1.

So make a new cell in jupyter and add:

from eskapade import DataStore

to import the DataStore module. Now to import the actual pickle and convert it back to the DataStore do:

ds = DataStore.import_from_file(os.environ['ESKAPADE']+'/results/Tutorial_1/proc_service_data/v0/latest/eskapade.core.process_services.DataStore.pkl')

to open the saved DataStore into variable ds. Now we can call the keys of the DataStore with

ds.Print()

We see there are two keys: data and transformed_data. Call one of them and see what is in there. You will find of course the pandas DataFrames that we used in the tutorial. Now you can use them in the notebook environment and directly interact with the objects without running the entirety of Eskapade.

Similarly you can open old ConfigObject and DataStore objects if they are available. By importing and calling:

from eskapade import ConfigObject
settings = ConfigObject.import_from_file(os.environ['ESKAPADE']+'/results/Tutorial_1/proc_service_data/v0/latest/eskapade.core.process_services.ConfigObject.pkl')

one can import the saved singleton at the path. The singleton can be any of the above mentioned stores/objects. Finally, by default there are soft-links in the results directory at results/$MACRO/proc_service_data/$VERSION/latest/ that point to the pickles of the associated objects from the last chain in the macro.

Tutorial 4: using RooFit

This section provides a tutorial on how to use RooFit in Eskapade. RooFit is an advanced fitting library in ROOT, which is great for modelling all sorts of data sets. See this tutorial for a 20 min introduction into RooFit. ROOT (and RooFit) works 'out of the box' in the Eskapade docker/vagrant image.

In this tutorial we will illustrates how to define a new probability density function (pdf) in RooFit, how to compile it, and how to use it in Eskapade to simulate a dataset, fit it, and plot the results.

Note

There are many good RooFit tutorials. See the macros in the directory $ROOTSYS/tutorials/roofit/ of your local ROOT installation. This tutorial is partially based on the RooFit tutorial $ROOTSYS/tutorials/roofit/rf104_classfactory.C.

Building a new probability density function

Before using a new model in Eskapade, we need to create, compile and load a probability density function model in RooFit.

Move to the directory:

$ cd $ESKAPADE/cxx/roofit/src/

Start an interactive python session and type:

import ROOT
ROOT.RooClassFactory.makePdf("MyPdfV2","x,A,B","","A*fabs(x)+pow(x-B,2)")

This command creates a RooFit skeleton probability density function class named MyPdfV2, with the variable x, a, b and the given formula expression.

Also type:

ROOT.RooClassFactory.makePdf("MyPdfV3","x,A,B","","A*fabs(x)+pow(x-B,2)",True,False,"x:(A/2)*(pow(x.max(rangeName),2)+pow(x.min(rangeName),2))+(1./3)*(pow(x.max(rangeName)-B,3)-pow(x.min(rangeName)-B,3))")

This creates the RooFit p.d.f. class MyPdfV3, with the variable x, a, b and the given formula expression, and the given expression for analytical integral over x.

Exit python (Ctrl-D) and type:

$ ls -l MyPdf*

You will see two cxx files and two header files. Open the file MyPdfV2.cxx. You should see an evaluate() method in terms of x, a and b with the formula expression we provided.

Now open the file MyPdfV3.cxx. This also contains the method analyticalIntegral() with the expresssion for the analytical integral over x that we provided.

If no analytical integral has been provided, as in MyPdfV2, RooFit will try to try to compute the integral itself. (Of course this is a costly operation.) If you wish, since we know the analytical integral for MyPdfV2, go ahead and edit MyPdfV2.cxx to add the expression of the analytical integral to the class.

As another example of a simple pdf class, take a look at the expressions in the file: $ESKAPADE/cxx/roofit/src/RooWeibull.cxx.

Now move the header files to their correct location:

$ mv MyPdfV*.h $ESKAPADE/cxx/roofit/include/

To make sure that these classes get picked up in Eskapade roofit libary, open the file $ESKAPADE/cxx/roofit/dict/LinkDef.h and add the lines:

#pragma link C++ class MyPdfV2+;
#pragma link C++ class MyPdfV3+;

Finally, let's compile the c++ code of these classes:

$ cd $ESKAPADE
$ make install

You should see the compiler churning away, processing several existing classes but also MyPdfV2 and MyPdfV3.

We are now able to open the Eskapade roofit library, so we can use these classes in python:

from eskapade.root_analysis import roofit_utils
roofit_utils.load_libesroofit()

In fact, this last snippet of code is used in the tutorial macro right below.

Running the tutorial macro

Let's take a look at the steps in tutorial macro $ESKAPADE/tutorials/tutorial_4.py. The macro illustrates how do basic statistical data analysis with roofit, by making use of the RooWorkspace functionality. A RooWorkspace is a persistable container for RooFit projects. A workspace can contain and own variables, p.d.f.s, functions and datasets. The example shows how to define a pdf, simulate data, fit this data, and then plot the fit result. There are 5 sections; they are detailed in the sections below.

The next step is to run the tutorial macro.

$ cd $ESKAPADE
$ source setup.sh
$ run_eskapade.py tutorials/tutorial_4.py

Let's discuss what we are seeing on the screen.

Loading the Eskapade ROOT library

The macro first checks the existence of the class MyPdfV3 that we just created in the previous section.

# --- 0. make sure Eskapade RooFit library is loaded

# --- load and compile the Eskapade roofit library
from eskapade.root_analysis import roofit_utils
roofit_utils.load_libesroofit()

# --- check existence of class MyPdfV3 in ROOT
pdf_name = 'MyPdfV3'
log.info('Now checking existence of ROOT class %s' % pdf_name)
cl = ROOT.TClass.GetClass(pdf_name)
if not cl:
    log.critical('Could not find ROOT class %s. Did you build and compile it correctly?' % pdf_name)
    sys.exit(1)
else:
    log.info('Successfully found ROOT class %s' % pdf_name)

In the output on the screen, look for Now checking existence of ROOT class MyPdfV3. If this was successful, it should then say Successfully found class MyPdfV3.

Instantiating a pdf

The link WsUtils, which stands for RooWorkspace utils, allows us to instantiate a pdf. Technically, one defines a model by passing strings to the rooworkspace factory. For examples on using the rooworkspace factory see here, here and here for more details. The entire rooworkspace factory syntax can be found here.

ch = proc_mgr.add_chain('WsOps')

# --- instantiate a pdf
wsu = root_analysis.WsUtils(name = 'modeller')
wsu.factory = ["MyPdfV3::testpdf(y[-10,10],A[10,0,100],B[2,-10,10])"]
ch.add_link(wsu)

Here we use the pdf class we just created (MyPdfV3) to create a pdf called testpdf, with observable y and parameter A and B, having ranges (-10,10), (0,100) and (-10,10) respectively, and with initial values for A and B of 10 and 2 respectively.

Simulating data

The link WsUtils is then used to simulate records according to the shape of testpdf.

wsu = root_analysis.WsUtils(name = 'simulater')
wsu.add_simulate(pdf='testpdf', obs='y', num=400, key='simdata')
ch.add_link(wsu)

Here we simulate 400 records of observable y with pdf testpdf (which is of type MyPdfV3). The simulated data is stored in the datastore under key simdata.

Fitting the data

Another version of the link WsUtils is then used to fit the simulated records with the pdf testpdf.

wsu = root_analysis.WsUtils(name = 'fitter')
wsu.pages_key='report_pages'
wsu.add_fit(pdf='testpdf', data='simdata', key='fit_result')
ch.add_link(wsu)

The link performs a fit of pdf testpdf to dataset simdata. We store the fit result object in the datastore under key fit_result. The fit knows from the input dataset that the observable is y, so that the fit parameters are A and B.

Plotting the fit result

Finally, the last version of the link WsUtils is used to plot the result of the fit on top of simulated data.

wsu = root_analysis.WsUtils(name = 'plotter')
wsu.pages_key='report_pages'
wsu.add_plot(obs='y', data='simdata', pdf='testpdf', pdf_kwargs={'VisualizeError': 'fit_result', 'MoveToBack': ()}, key='simdata_plot')
wsu.add_plot(obs='y', pdf='testpdf', file='fit_of_simdata.pdf', key='simdata_plot')
ch.add_link(wsu)

This link is configured to do two things. First it plots the observable y of the the dataset simdata and then plots the fitted uncertainy band of the pdf testpdf on top of this. The plot is stored in the datastore under the key simdata_plot. Then it plots the fitted pdf testpdf without uncertainty band on top of the same frame simdata_plot. The resulting plot is stored in the file fit_of_simdata.pdf

Fit report

The link WsUtils produces a summary report of the fit it has just performed. The pages of this report are stored in the datastore under the key report_pages. At the end of the Eskapade session, the plots and latex files produced by this tutorial are written out to disk.

The fit report can be found at:

$ cd $ESKAPADE/results/tutorial_4/data/v0/report/
$ pdflatex report.tex

Take a look at the resulting fit report: report.pdf. It contains pages summarizing: the status and quality of the fit (including the correlation matrix), summary tables of the floating and fixed parameters in the fit, as well as the plot we have produced.

Other ROOT Examples in Eskapade

Other example Eskapade macros using ROOT and RooFit can be found in the $ESKAPADE/tutorials directory, e.g. see esk401_roothist_fill_plot_convert.py and all other 400 numbered macros.

Tutorial 5: going Spark

This section provides a tutorial on how to use Apache Spark in Eskapade. Spark works 'out of the box' in the Eskapade docker/vagrant image. For details on how to setup a custom Spark setup, see the Spark section in the Appendix.

In this tutorial we will basically redo Tutorial 1 but use Spark instead of Pandas for data processing. The following paragraphs describe step-by-step how to run a Spark job, use existing links and write your own links for Spark queries.

Note

To get familiar with Spark in Eskapade you can follow the exercises in tutorials/tutorial_5.py.

Running the tutorial macro

The very first step to run the tutorial Spark job is:

$ source setup.sh
$ run_eskapade.py tutorials/tutorial_5.py

Eskapade will start a Spark session, do nothing, and quit - there are no chains/links defined yet. The Spark session is created via the SparkManager which, like the DataStore, is a singleton that configures and controls Spark sessions centrally. It is activated through the magic line:

proc_mgr.service(SparkManager).spark_session

Note that when the Spark session is created, the following line appears in logs:

Adding Python modules to ZIP archive /Users/gossie/git/gitlab-nl/decision-engine/eskapade/es_python_modules.zip

This is the SparkManager that ensures all Eskapade source code is uploaded and available to the Spark cluster when running in a distributed environment.

If there was an ImportError: No module named pyspark then, most likely, SPARK_HOME and PYTHONPATH are not set up correctly. For details, see the Spark section in the Appendix.

Reading data

Spark can read data from various sources, e.g. local disk, HDFS, HIVE tables. Eskapade provides the SparkDfReader link that uses the pyspark.sql.DataFrameReader to read flat CSV files into Spark DataFrames, RDD's, and Pandas DataFrames. To read in the Tutorial data, the following link should be added to the Data chain:

reader = spark_analysis.SparkDfReader(name='Read_LA_ozone', store_key='data', read_methods=['csv'])
reader.read_meth_args['csv'] = (DATA_FILE_PATH,)
reader.read_meth_kwargs['csv'] = dict(sep=',', header=True, inferSchema=True)
proc_mgr.get_chain('Data').add_link(reader)

The DataStore holds a pointer to the Spark dataframe in (distributed) memory. This is different from a Pandas dataframe, where the entire dataframe is stored in the DataStore, because a Spark dataframe residing on the cluster may not fit entirely in the memory of the machine running Eskapade. This means that Spark dataframes are never written to disk in DataStore pickles!

Spark examples

Example Eskapade macros using Spark can be found in the tutorials directory, see esk601_spark_configuration.py and further.

Spark Streaming

Eskapade supports the use of Spark Streaming as demonstrated in the word count example tutorials/esk610_spark_streaming_wordcount.py. The data is processed in (near) real-time as micro batches of RDD's, so-called discretized streaming, where the stream originates from either new incoming files or network connection. As with regulard Spark queries, various transformations can be defined and applied in subsequent Eskapade links.

For details on Spark Streaming, see also https://spark.apache.org/docs/latest/streaming-programming-guide.html.

File stream

The word count example using the file stream method can be run by executing in two different terminals:

terminal 1 $ for ((i=0; i<=100; i++)); do echo "Hello world" > /tmp/dummy_$(printf %05d ${i}); sleep 0.1; done
terminal 2 $ run_eskapade -c stream_type='tcp' $ESKAPADE/tutorials/esk610_spark_streaming.py

Where bash for-loop will create a new file containing Hello world in the /tmp directory every 0.1 second. Spark Streaming will pick up and process these files and in terminal 2 a word count of the processed data will by dispayed. Output is stored in $ESKAPADE/results/esk610_spark_streaming/data/v0/dstream/wordcount.

TCP stream

The word count example using the TCP stream method can be run by executing in two different terminals:

terminal 1 $ nc -lk 9999
terminal 2 $ run_eskapade -c stream_type='tcp' $ESKAPADE/tutorials/esk610_spark_streaming.py

Where nc (netcat) will stream data to port 9999 and Spark Streaming will listen to this port and process incoming data. In terminal 1 random words can be type (followed by enter) and in terminal 2 a word count of the processed data will by dispayed. Output is stored in $ESKAPADE/results/esk610_spark_streaming/data/v0/dstream/wordcount.

All available examples

Every subpackage of Eskapade contains links in its links/ subdirectory.

  • core_ops contains links pertaining to the core functionality of Eskapade, where the core package is the core framework of Eskapade.
  • analysis contains pandas links.
  • visualization contains plotter links.
  • root_analysis contains ROOT links for data generation, fitting, and plotting.
  • data_quality contains links for fixing messy data.
  • spark_analysis contains spark related analysis links.

The name of every link indicates its basic function. If you want to know explicitly you can read the API-docs. If that does not help, read and try to understand the example macros in tutorials/, which show the basic usage of most Eskapade functionality. (See also the Examples section right below.) If still unclear, go into the link's code to find out how it exactly works.

Many Eskapade example macros can be found in the tutorials directory. The numbering of the example macros follows the package structure:

  • esk100+: basic macros describing the chains, links, and datastore functionality of Eskapade.
  • esk200+: macros describing links to do basic processing of pandas dataframes.
  • esk300+: visualization macros for making histograms, plots and reports of datasets.
  • esk400+: macros for processing ROOT datasets and performing fits to data using RooFit.
  • esk500+: macros for doing data quality assessment and cleaning.
  • esk600+: macros describing links to do basic processing of data and rdds with spark.

The basic Eskapade macros (esk100+) are briefly described below. They explain the basic architecture of Eskapade, i.e. how the chains, links, datastore, and process manager interact.

Hopefully you now have enough knowledge to run and explore Eskapade by yourself. You are encouraged to run all examples to see what Eskapade can do for you!

Example esk101: Hello World!

Macro 101 runs the Hello World Link. It runs the Link twice using a repeat kwarg, showing how to use kwargs in Links.

Example esk102: Multiple chains

Macro 102 uses multiple chains to print different kinds of output from one Link. This link is initialized multiple times with different kwargs and names. There are if-statements in the macro to control the usage of the chains.

Example esk103: Print the DataStore

Macro 103 has some objects in the DataStore. The contents of the DataStore are printed in the standard output.

Example esk104: Basic DataStore operations

Macro 104 adds some objects from a dictionary to the DataStore and then moves or deletes some of the items. Next it adds more items and prints some of the objects.

Example esk105: DataStore Pickling

Macro 105 has 3 versions: A, B and C. These are built on top of the basic macro esk105. Each of these 3 macro's does something slightly different:

  • A does not store any output pickles,
  • B stores all output pickles,
  • C starts at the 3rd chain of the macro.

Using these examples one can see how the way macro's are run can be controlled and what it saves to disk.

Example esk106: Command line arguments

Macro 106 shows us how command line arguments can be used to control the chains in a macro. By adding the arguments from the message inside of the macro we can see that the chains are not run.

Example esk107: Chain loop

Example 107 adds a chain to the macro and using a repeater Link it repeats the chain 10 times in a row.

Example esk108: Event loop

Example 108 processes a textual data set, to loop through every word and do a Map and Reduce operation on the data set. Finally a line printer prints out the result.

Example esk109: Debugging tips

This macro illustrates basic debugging features of Eskapade. The macro shows how to start interactive ipython sessions while running through the chains, and also how to break out of a chain.

Example esk110: Code profiling

This macro demonstrates how to run Eskapade with code profiling turned on.

Example esk201: Read data

Macro 201 reads a file into the DataStore. The first chain reads one csv into the DataStore, the second chain reads multiple files (actually the same file multiple times) into the DataStore. (Looping over data is shown in example esk209.)

Example esk202: Write data

Macro 202 reads a DataFrame into the data store and then writes the DataFrame to csv format on the disk.

Tips on coding

This section contains a general description on how to use Eskapade in combination with other tools, in particular for the purpose of developing code.

Eskapade in PyCharm

PyCharm is a very handy IDE for debugging Python source code. It can be used to run Eskapade stand-alone (i.e. like from the command line) and with an API.

Stand-alone
  • Install PyCharm on your machine.
  • Open project and point to the Eskapade source code
  • Configuration, in 'Preferences', check the following desired values:
    • Under 'Project: eskapade' / 'Project Interpreter':
      • The correct Python version (currently 3.5.2 of Anaconda, use the interpreter of your conda environment)
    • Under 'Build, Execution & Deployment' / 'Console' / 'Python Console':
      • The correct Python version (currently 3.5.2 of Anaconda, use the interpreter of your conda environment)
  • Run/Debug Configuration:
    • Under 'Python' add new configuration
    • Script: scripts/run_eskapade.py
    • Script parameters: -w ../tutorials/tutorial_1.py
    • Working directory: $ESKAPADE
    • Python interpreter: check if it is the correct Python version (currently 3.5.2 of Anaconda, corresponding to your conda environment)
    • Environment variables: should contain those defined in $ESKAPADE/setup.sh.

You should now be able to press the 'play button' to run Eskapade with the specified parameters.