Miscellaneous

Collection of miscelleneous Eskapade related items.

  • See Migration Tips to migrate between Eskapade versions.
  • See macOS to get started with Eskapade on a mac.
  • See Apache Spark for details on using Spark with Eskapade.

Migration Tips

From version 0.6 to 0.7

Below we list the API changes needed to migrate from Eskapade version 0.6 to version 0.7.

Macros

  • Process manager definition:

    • proc_mgr. change to process_manager.
    • ProcessManager change to process_manager
    • Delete line: proc_mgr = ProcessManager()
  • Logger:

    • import logging change to from eskapade.logger import Logger, LogLevel
    • log. change to logger.
    • log = logging.getLogger('macro.cpf_analysis') change to logger = Logger()
    • logging change to LogLevel
  • Settings:

    Remove os.environ['WORKDIRROOT'], since the environment variable WORKDIRROOT is no longer defined, define explicitly the data and macro paths, or execute the macros and tests from the root directory of the project, resulting in something like:

    • settings['resultsDir'] = os.getcwd() + 'es_results'
    • settings['macrosDir']  = os.getcwd() + 'es_macros'
    • settings['dataDir']    = os.getcwd() + 'data'
  • Chain definition in macros:

    • To import the Chain object add from eskapade import Chain
    • Change process_manager.add_chain('chain_name') to <chain_name> = Chain('chain_name')
    • process_manager.get_chain('ReadCSV').add_link to <chain_name>.add

Tests

  • Process manager definition:

    • Change ProcessManager() to process_manager
    • Change process_manager.get_chain to process_manager.get
  • Settings:

    Remove os.environ['WORKDIRROOT'], since the environment variable WORKDIRROOT is no longer defined, define explicitly the data and macro paths, or execute the macros and tests from the root directory of the project, resulting in something like:

    • settings['macrosDir'] = os.getcwd() + '/es_macros'
    • settings['dataDir']   = os.getcwd() + '/data'
  • StatusCode:

    • Change status.isSkipChain() to status.is_skip_chain()

macOS

To install Eskapade on macOS there are basically four steps:

  • Setting up Python 3.6
  • Setting up Apache Spark 2.x
  • Setting up ROOT 6.10.08
  • Setting up Eskapade

Note

This installation guide is written for macOS High Sierra with Homebrew, and fish.

Setting up Python 3.6

Homebrew provides Python 3.6 for macOS:

$ brew install python3

To create an isolated Python installation use virtualenv:

$ virtualenv venv/eskapade --python=python3 --system-site-packages

Each time a new terminal is started, set up the virtual python environment:

$ . ~/venv/eskapade/bin/activate.fish

Setting up ROOT 6

Clone ROOT from the git repository:

git clone http://root.cern.ch/git/root.git
cd root
git checkout -b v6-10-08 v6-10-08

Then compile it with the additional flags to ensure the desired functionality:

$ mkdir ~/root_v06-10-08_p36m && cd ~/root_v06-10-08_p36m
$ cmake -Dfftw3=ON -Dmathmore=ON -Dminuit2=ON -Droofit=ON -Dtmva=ON -Dsoversion=ON -Dthread=ON -Dpython3=ON -DPYTHON_EXECUTABLE=/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/bin/python3.6m -DPYTHON_INCLUDE_DIR=/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/include/python3.6m/ -DPYTHON_LIBRARY=/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/libpython3.6m.dylib $HOME/root
$ cmake --build . -- -j7

PS: make sure all the flags are picked up correctly (for example, -Dfftw3 requires fftw to be installed with Homebrew).

To setup the ROOT environment each time a new shell is started, set the following environment variables:

set -xg ROOTSYS "$HOME/root_v06-10-08_p36m"
set -xg PATH $ROOTSYS/bin $PATH
set -xg LD_LIBRARY_PATH "$ROOTSYS/lib:$LD_LIBRARY_PATH"
set -xg DYLD_LIBRARY_PATH "$ROOTSYS/lib:$DYLD_LIBRARY_PATH"
set -xg LIBPATH "$ROOTSYS/lib:$LIBPATH"
set -xg SHLIB_PATH "$ROOTSYS/lib:$SHLIB_PATH"
set -xg PYTHONPATH "$ROOTSYS/lib:$PYTHONPATH"

Note that for bash shells this can be done by sourcing the script in root_v06-10-08_p36m/bin/thisroot.sh.

Finally, install the Python packages for ROOT bindings:

$ pip install rootpy==1.0.1 root-numpy=4.7.3

Setting up Apache Spark 2.x

Apache Spark is provided through Homebrew:

$ brew install apache-spark

The py4j package is needed to support access to Java objects from Python:

$ pip install py4j==0.10.4

To set up the Spark environment each time a new terminal is started set:

set -xg SPARK_HOME (brew --prefix apache-spark)/libexec
set -xg SPARK_LOCAL_HOSTNAME "localhost"
set -xg PYTHONPATH "$SPARK_HOME/python:$PYTHONPATH"

Setting up Eskapade

The Eskapade source code can be obtained from git:

$ git clone git@github.com:KaveIO/Eskapade.git eskapade

To set up the Eskapade environment (Python, Spark, ROOT) each time a new terminal is started, source a shell script (e.g. setup_eskapade.fish) that contains set the environment variables as described above:

# --- setup Python
. ~/venv/eskapade/bin/activate.fish

# --- setup ROOT
set -xg ROOTSYS "${HOME}/root_v06-10-08_p36m"
set -xg PATH $ROOTSYS/bin $PATH
set -xg LD_LIBRARY_PATH "$ROOTSYS/lib:$LD_LIBRARY_PATH"
set -xg DYLD_LIBRARY_PATH "$ROOTSYS/lib:$DYLD_LIBRARY_PATH"
set -xg LIBPATH "$ROOTSYS/lib:$LIBPATH"
set -xg SHLIB_PATH "$ROOTSYS/lib:$SHLIB_PATH"
set -xg PYTHONPATH "$ROOTSYS/lib:$PYTHONPATH"

# --- setup Spark
set -xg SPARK_HOME (brew --prefix apache-spark)/libexec
set -xg SPARK_LOCAL_HOSTNAME "localhost"
set -xg PYTHONPATH "$SPARK_HOME/python:$PYTHONPATH"

# --- setup Eskapade
cd /path/to/eskapade

Finally, install Eskapade (and it’s dependencies) by simply running:

$ pip install -e /path/to/eskapade

Apache Spark

Eskapade supports the use of Apache Spark for parallel processing of large data volumes. Jobs can run on a single laptop using Spark libraries as well as on a Spark/Hadoop cluster in combination with YARN. This section describes how to setup and configure Spark for use with Eskapade. For examples on running Spark jobs with Eskapade, see the Spark tutorial.

Note

Eskapade supports both batch and streaming processing with Apache Spark.

Requirements

A working setup of the Apache Spark libraries is included in both the Eskapade docker and vagrant image (see section Installation). For installation of Spark libraries in a custom setup, please refer to the Spark documentation.

Spark installation

The environment variables SPARK_HOME and PYTHONPATH need be set and to point to the location of the Spark installation and the Python libraries of Spark and py4j (dependency). In the Eskapade docker, for example, it is set to:

$ echo $SPARK_HOME
/opt/spark/pro/
$ echo $PYTHONPATH
/opt/spark/pro/python:/opt/spark/pro/python/lib/py4j-0.10.4-src.zip:...

Configuration

The Spark configuration can be set in two ways:

  1. an Eskapade macro (preferred)
  2. an Eskapade link

This is demonstrated in the following tutorial macro:

$ eskapade_run python/eskapade/tutorials/esk601_spark_configuration.py

Both methods are described below. For a full explanation of Spark configuration settings, see Spark Configuration. In case configuration settings seem not to be picked up correctly, please check Notes at the end of this section.

Eskapade macro (preferred)

This method allows to specify settings per macro, i.e. per analysis, and is therefore the preferred way for bookkeeping analysis-specific settings.

The most easy way to start a Spark session is:

from eskapade import process_manager
from eskapade.spark_analysis import SparkManager

spark = sm.create_session(eskapade_settings=settings)
sc = spark.sparkContext

The default Spark configuration file python/eskapade/config/spark/spark.cfg will be picked up. It contains the following settings:

[spark]
spark.app.name=es_spark
spark.jars.packages=org.diana-hep:histogrammar-sparksql_2.11:1.0.4
spark.master=local[*]
spark.driver.host=localhost

The default Spark settings can be adapted here for all macros at once. In case, alternative settings are only relevant for a single analysis, those settings can also be specified in the macro using the argument variables in the create_session method of the SparkManager:

from eskapade import process_manager
from eskapade.spark_analysis import SparkManager

spark = sm.create_session(spark_settings=[('spark.app.name', 'es_spark_alt_config'), ('spark.master', 'local[42]')])

sm = process_manager.service(SparkManager)
spark = sm.create_session(eskapade_settings=settings,
                          spark_settings=spark_settings,
                          config_path='/path/to/alternative/spark.cfg',
                          enable_hive_support=False,
                          include_eskapade_modules=False
                         )

Where all arguments are optional:

  • eskapade_settings default configuration file as specified by the sparkCfgFile key in ConfigObject (i.e. spark.cfg)
  • config_path alternative path to configuration file
  • spark_settings list of key-value pairs to specify additional Spark settings
  • enable_hive_support: switch to disable/enable Spark Hive support
  • include_eskapade_modules: switch to include/exclude Eskapade modules in Spark job submission (e.g. for user-defined functions)

Parameters

The most important parameters to play with for optimal performance:

  • num-executors
  • executor-cores
  • executor-memory
  • driver-memory

Dynamic allocation

Since version 2.1, Spark allows for dynamic resouce allocation. This requires the following settings:

  • spark.dynamicAllocation.enabled=true
  • spark.shuffle.service.enabled=true

Depending on the mode (standalone, YARN, Mesos), an additional shuffle service needs to be set up. See the documentation for details.

Logging

The logging level of Spark can be controlled in two ways:

  1. through $SPARK_HOME/conf/log4j.properties
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=INFO
  1. through the SparkContext in Python:
spark = process_manager.service(SparkManager).get_session()
spark.sparkContext.setLogLevel('INFO')

PS: the loggers in Python can be controlled through:

import logging
print(logging.Logger.manager.loggerDict) # obtain list of all registered loggers
logging.getLogger('py4j').setLevel('INFO')
logging.getLogger('py4j.java_gateway').setLevel('INFO')

However, not all Spark-related loggers are available here (as they are JAVA-based).

Notes

There are a few pitfalls w.r.t. setting up Spark correctly:

1. If the environment variable PYSPARK_SUBMIT_ARGS is defined, its settings may override those specified in the macro/link. This can be prevented by unsetting the variable:

$ unset PYSPARK_SUBMIT_ARGS

or in the macro:

import os
del os.environ['PYSPARK_SUBMIT_ARGS']

The former will clear the variable from the shell session, whereas the latter will only clear it in the Python session.

2. In client mode not all driver options set via SparkConf are picked up at job submission because the JVM has already been started. Those settings should therefore be passed through the SPARK_OPTS environment variable, instead of using SparkConf in an Eskapade macro or link:

SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info --driver-memory 2g

3. In case a Spark machine is not connected to a network, setting the SPARK_LOCAL_HOSTNAME environment variable or the spark.driver.host key in SparkConf to the value localhost may fix DNS resolution timeouts which prevent Spark from starting jobs.