eskapade.analysis package

Submodules

eskapade.analysis.correlation module

Project: Eskapade - A python-based package for data analysis.

Created: 2018/06/23

Description:

Correlation related util functions.

Convert Pearson correlation value into a chi2 value of a contingency test matrix of a bivariate gaussion, and vice-versa. Calculation uses scipy’s mvn library. Calculates correlation coëfficients based on mutual_information, correlation_ratio, pearson, kendall or spearman methods.

Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

eskapade.analysis.correlation.calculate_correlations(df, method)

Calculates correlation coefficients between every column pair.

Parameters:
  • df (pd.DataFrame) – input data frame
  • method (str) – mutual_information, correlation_ratio, pearson, kendall or spearman, phik, significance
Returns:

pd.DataFrame

eskapade.analysis.correlation.chi2_from_rho(rho, n, subtract_from_chi2=0, corr0=None, sx=None, sy=None, nx=-1, ny=-1)

Calculate chi2-value of bivariate gauss having correlation value rho

Calculate no-noise chi2 value of bivar gauss with correlation rho, with respect to bivariate gauss without any correlation.

Returns float:chi2 value
eskapade.analysis.correlation.rho_from_chi2(chi2, n, nx, ny, sx=None, sy=None)

correlation coefficient of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Returns float:correlation coefficient

eskapade.analysis.datetime module

Project: Eskapade - A python-based package for data analysis.

Classes: TimePeriod, FreqTimePeriod

Created: 2017/03/14

Description:
Time period and time period with frequency.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

modification, are permitted according to the terms listed in the file Redistribution and use in source and binary forms, with or without LICENSE.

class eskapade.analysis.datetime.FreqTimePeriod(**kwargs)

Bases: eskapade.analysis.datetime.TimePeriod

Time period with frequency.

__init__(**kwargs)

Initialize TimePeriod instance.

dt_string(period_index)

Convert period index into date/time string (start of period).

Parameters:period_index (int) – specified period index value.
freq

Return frequency.

period_index(dt)

Return number of periods until date/time “dt” since 1970-01-01.

Parameters:dt – specified date/time parameter
class eskapade.analysis.datetime.TimePeriod(**kwargs)

Bases: escore.core.mixin.ArgumentsMixin

Time period.

__init__(**kwargs)

Initialize TimePeriod instance.

logger

A logger that emits log messages to an observer.

The logger can be instantiated as a module or class attribute, e.g.

>>> logger = Logger()
>>> logger.info("I'm a module logger attribute.")
>>>
>>> class Point(object):
>>>     logger = Logger()
>>>
>>>     def __init__(self, x = 0.0, y = 0.0):
>>>         Point.logger.debug('Initializing {point} with x = {x}  y = {y}', point=Point, x=x, y=y)
>>>         self._x = x
>>>         self._y = y
>>>
>>>     @property
>>>     def x(self):
>>>         self.logger.debug('Getting property x = {point._x}', point=self)
>>>         return self._x
>>>
>>>     @x.setter
>>>     def x(self, x):
>>>         self.logger.debug('Setting property y = {point._x}', point=self)
>>>         self._x = x
>>>
>>>     @property
>>>     def y(self):
>>>        self.logger.debug('Getting property y = {point._y}', point=self)
>>>        return self._y
>>>
>>>     @y.setter
>>>     def y(self, y):
>>>         self.logger.debug('Setting property y = {point._y}', point=self)
>>>         self._y = y
>>>
>>> a_point = Point(1, 2)
>>>
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
>>> logger.log_level = LogLevel.DEBUG
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)

The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.

Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.

classmethod parse_date_time(dt)

Try to parse specified date/time.

Parameters:dt – specified date/time
classmethod parse_time_period(period)

Try to parse specified time period.

Parameters:period – specified period
period_index(dt)

Get number of periods until date/time “dt”.

Parameters:dt – specified date/time
class eskapade.analysis.datetime.UniformTsTimePeriod(**kwargs)

Bases: eskapade.analysis.datetime.TimePeriod

Time period with offset.

__init__(**kwargs)

Initialize TimePeriod instance.

offset

Get offset parameter.

period

Get period parameter.

period_index(dt)

Get number of periods until date/time “dt” since “offset”, given specified “period”.

Parameters:dt – specified date/time

eskapade.analysis.histogram module

Project: Eskapade - A python-based package for data analysis.

Classes: ValueCounts, BinningUtil, Histogram

Created: 2017/03/14

Description:
Generic 1D Histogram class.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

modification, are permitted according to the terms listed in the file Redistribution and use in source and binary forms, with or without LICENSE.

class eskapade.analysis.histogram.BinningUtil(**kwargs)

Bases: object

Helper for interpreting bin specifications.

BinningUtil is a helper class used for interpreting bin specification dictionaries. It is a base class for the Histogram class.

__init__(**kwargs)

Initialize link instance.

A bin_specs dictionary needs to be provided as input. bins_specs is a dict containing ‘bin_width’ and ‘bin_offset’ keys. In case bins widths are not equal, bin_specs contains ‘bin_edges’ (array) instead of ‘bin_width’ and ‘bin_offset’. ‘bin_width’ and ‘bin_offset’ can be numeric or numpy timestamps.

Alternatively, bin_edges can be provided as input to bin_specs.

Example bin_specs dictionaries are:

>>> bin_specs = {'bin_width': 1, 'bin_offset': 0}
>>> bin_spect = {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}
>>> bin_specs = {'bin_width': np.timedelta64(30, 'D'),
                 'bin_offset': np.datetime64('2010-01-04')}
Parameters:
  • bin_specs (dict) – dictionary contains ‘bin_width’ and ‘bin_offset’ numbers or ‘bin_edges’ array. Default is None.
  • bin_edges (list) – array with numpy histogram style bin_edges. Default is None.
bin_specs

Get bin_specs dictionary.

Returns:bin_specs dictionary
Return type:dict
get_bin_center(bin_label)

Return bin center for a given bin index.

Parameters:bin_label – bin label for which to find the bin center
Returns:bin center, can be float, int, timestamp
get_bin_edges()

Return bin edges.

Returns:bin edges
Return type:array
get_bin_edges_range()

Return bin range determined from bin edges.

Returns:bin range
Return type:tuple
get_left_bin_edge(bin_label)

Return left bin edge for a given bin index.

Parameters:bin_label – bin label for which to find the left bin edge
Returns:bin edge, can be float, int, timestamp
get_right_bin_edge(bin_label)

Return right bin edge for a given bin index.

Parameters:bin_label – bin label for which to find the right bin edge.
Returns:bin edge, can be float, int, timestamp
truncated_bin_edges(variable_range=None)

Bin edges corresponding to a given variable range.

Parameters:variable_range (list) – variable range used for finding the right bin edges array. Optional.
Returns:truncated bin edges
Return type:array
value_to_bin_label(var_value, greater_equal=False)

Return bin index for given bin value.

Parameters:
  • var_value – variable value for which to find the bin index
  • greater_equal (bool) – for float, int, timestamp, return index of bin for which value falls in range [lower edge, upper edge). If set to true, return index of bin for which value falls in range [lower edge, upper edge]. Default if false.
Returns:

bin index

Return type:

int

class eskapade.analysis.histogram.Histogram(counts, **kwargs)

Bases: eskapade.analysis.histogram.BinningUtil, escore.core.mixin.ArgumentsMixin

Generic 1D Histogram class.

Histogram holds bin labels (name of each bin), value_counts (values of the histogram) and a variable name. The bins can be categoric or numeric, where numeric includes timestamps. In case of numeric bins, bin_specs is set. bins_specs is a dict containing bin_width and bin_offset. In case bins widths are not equal, bin_specs contains bin_edges instead of bin_width and bin_offset.

__init__(counts, **kwargs)

Initialize Histogram instance.

A bin_specs dictionary can be provided as input. bins_specs is a dict containing ‘bin_width’ and ‘bin_offset’ keys. In case bins widths are not equal, bin_specs contains ‘bin_edges’ (array) instead of ‘bin_width’ and ‘bin_offset’. ‘bin_width’ and ‘bin_offset’ can be numeric or numpy timestamps.

Histogram counts can be specified as a ValueCounts object, a dictionary or a tuple:

  • tuple: Histogram((bin_values, bin_edges), variable=<your_variable_name>)
  • dict: a dictionary as comes out of pandas.series.value_counts() or pandas.Dataframe.groupby.size() over one variable.
  • ValueCounts: a ValueCounts object contains a value_counts dictionary.

Example bin_specs dictionaries are:

>>> bin_specs = { 'bin_width': 1, 'bin_offset': 0 }
>>> bin_spect = { 'bin_edges': [0,2,3,4,5,7,8] }
>>> bin_specs = { 'bin_width': np.timedelta64(30,'D'),
                  'bin_offset': np.datetime64('2010-01-04') }
Parameters:
  • counts – histogram counts
  • bin_specs (dict) – dictionary contains ‘bin_width’ and ‘bin_offset’ numbers or ‘bin_edges’ array (default is None)
  • variable (str) – name of the variable represented by the histogram
  • datatype (type) – data type of the variable represented by the histogram (optional)
bin_centers()

Return bin centers.

Returns:array of the bin centers
Return type:array
bin_edges()

Return numpy style bin_edges array with uniform binning.

Returns:array of all bin edges
Return type:array
bin_entries()

Return number of bin entries.

Return the bin counts of the known bins in the value_counts object.

Returns:array of the bin counts
Return type:array
bin_labels()

Return bin labels.

Returns:array of all bin labels
Return type:array
classmethod combine_hists(hists, labels=False, rel_bin_width_tol=1e-06, **kwargs)

Combine a set of histograms.

Parameters:
  • hists (array) – array of Histograms to add up.
  • labels (label) – histograms to add up have labels? (else are numeric) Default is False.
  • variable (str) – name of variable described by the summed-up histogram
  • rel_bin_width_tol (float) – relative tolerance between numeric bin edges.
Returns:

summed up histogram

Return type:

Histogram

copy(**kwargs)

Return a copy of this histogram.

Parameters:variable (str) – assign new variable name
datatype

Data type of the variable represented by the histogram.

Returns:data type
Return type:type
get_bin_count(bin_label)

Get bin count for specific bin label.

Parameters:bin_label – a specific key to find corresponding bin.
Returns:bin counter value
Return type:int
get_bin_labels()

Return all bin labels.

Returns:array of all bin labels
Return type:array
get_bin_range()

Return the bin range.

Returns:tuple of the bin range found
Return type:tuple
get_bin_vals(variable_range=None, combine_values=True)

Get bin labels/edges and corresponding bin counts.

Bin values corresponding to a given variable range.

Parameters:
  • variable_range (list) – variable range used for finding the right bins to get values from. Optional.
  • combine_values (bool) – if bin_specs is not set, combine existing bin labels with variable range.
Returns:

two arrays of bin values and bin edges

Return type:

array

get_hist_val(var_value)

Get bin count for bin by value of histogram variable.

Parameters:var_value – a specific value to find corresponding bin.
Returns:bin counter value
Return type:int
get_nonone_bin_centers()

Return bin centers.

Returns:array of the bin centers
Return type:array
get_nonone_bin_counts()

Return bin counts.

Returns:array of the bin counts
Return type:array
get_nonone_bin_edges()

Return numpy style bin-edges array.

Returns:array of the bin edges
Return type:array
get_nonone_bin_range()

Return the bin range.

Returns:tuple of the bin range found
Return type:tuple
get_uniform_bin_edges()

Return numpy style bin-edges array with uniform binning.

Returns:array of all bin edges
Return type:array
logger

A logger that emits log messages to an observer.

The logger can be instantiated as a module or class attribute, e.g.

>>> logger = Logger()
>>> logger.info("I'm a module logger attribute.")
>>>
>>> class Point(object):
>>>     logger = Logger()
>>>
>>>     def __init__(self, x = 0.0, y = 0.0):
>>>         Point.logger.debug('Initializing {point} with x = {x}  y = {y}', point=Point, x=x, y=y)
>>>         self._x = x
>>>         self._y = y
>>>
>>>     @property
>>>     def x(self):
>>>         self.logger.debug('Getting property x = {point._x}', point=self)
>>>         return self._x
>>>
>>>     @x.setter
>>>     def x(self, x):
>>>         self.logger.debug('Setting property y = {point._x}', point=self)
>>>         self._x = x
>>>
>>>     @property
>>>     def y(self):
>>>        self.logger.debug('Getting property y = {point._y}', point=self)
>>>        return self._y
>>>
>>>     @y.setter
>>>     def y(self, y):
>>>         self.logger.debug('Setting property y = {point._y}', point=self)
>>>         self._y = y
>>>
>>> a_point = Point(1, 2)
>>>
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
>>> logger.log_level = LogLevel.DEBUG
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)

The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.

Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.

n_bins

Number of bins in the ValueCounts object.

Returns:number of bins
Return type:int
n_dim

Number of histogram dimensions.

The number of histogram dimensions, which is equal to one by construction.

Returns:number of dimensions
Return type:int
num_bins

Number of bins.

Returns:number of bins
Return type:int
remove_keys_of_inconsistent_type(prefered_key_type=None)

Remove all keys that have inconsistent data type(s).

Parameters:prefered_key_type (tuple) – the prefered key type to keep. Can be a tuple, list, or single type. E.g. str or (int,str,float). If None provided, the most common key type found is kept.
simulate(size, *args)

Simulate data using self (Histogram instance) as PDF.

see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html

Parameters:size (int) – number of data points to generate
Return numpy.array generated_data:
 the generated data
Returns:Histogram of the generated data
Return type:Histogram
surface()

Calculate surface of the histogram.

Returns:surface
to_normalized(**kwargs)

Return a normalized copy of this histogram.

Parameters:
  • new_var_name (str) – assign new variable name
  • variable_range (list) – variable range used for finding the right bins to get values from.
  • combine_values (bool) – if bin_specs is not set, combine existing bin labels with variable range.
variable

Name of variable represented by the histogram.

Returns:variable name
Return type:string
class eskapade.analysis.histogram.ValueCounts(key, subkey=None, counts=None, sel=None)

Bases: object

A dictionary of value counts.

The dictionary of value counts comes out of pandas.series.value_counts() for one variable or pandas.Dataframe.groupby.size() performed over one or multiple variables.

__init__(key, subkey=None, counts=None, sel=None)

Initialize link instance.

Parameters:
  • key (list) – key is a tuple, list or string of (the) variable name(s), matching those and the structure of the keys in the value_counts dictionary.
  • subkey (list) – subset of key. If provided, the value_counts dictionary will be projected from key onto the (subset of) subkey. E.g. use this to map a two dimensional value_counts dictionary onto one specified dimension. Default is None. Optional.
  • counts (dict) – the value_counts dictionary.
  • sel (dict) – Apply selections to value_counts dictionary. Default is {}. Optional.
count(value_bin)

Get bin count for specific bin-key value bin.

Parameters:value_bin (tuple) – a specific key, and can be a list or tuple.
Returns:specific bin counter value
Return type:int
counts

Process and return value-counts dictionary.

Returns:after processing, returns the value_counts dictionary
Return type:dict
create_sub_counts(subkey, sel=None)

Project existing value counts onto a subset of keys.

E.g. map variables x,y onto single dimension x, so for each bin in x integrate over y.

Parameters:
  • subkey (tuple) – input sub-key, is a tuple, list, or string. This is the new key of variables for the returned ValueCounts object.
  • sel (dict) – dictionary with selection. Optional.
Returns:

value_counts object where subkey has become the new key.

Return type:

ValueCounts

get_values(val_keys=())

Get all key-values of a subset of keys.

E.g. give all x values in of the keys, when the value_counts object has keys (x, y).

Parameters:val_keys (tuple) – a specific sub-key to get key values for.
Returns:all key-values of a subset of keys.
Return type:tuple
key

Process and return current value-counts key.

Returns:the key
Return type:tuple
nononecounts

Return value-counts dictionary without None keys.

Returns:after processing, returns the value_counts dictionary without None keys
Return type:dict
num_bins

Number of value-counts bins.

Returns:number of bins
Return type:int
num_nonone_bins

Number of not-none value-counts bins.

Returns:number of not-none bins
Return type:int
process_counts(accept_equiv=True)

Project value counts onto the existing subset of keys.

E.g. map variables x,y onto single dimension x, so for each bin in x integrate over y.

Parameters:accept_equiv (bool) – accept equivalence of key and subkey if if subkey is in different order than key. Default is true.
Returns:successful projection or not
Return type:bool
remove_keys_of_inconsistent_type(prefered_key_type=None)

Remove keys with inconsistent data type(s).

Parameters:prefered_key_type (tuple) – the prefered key type to keep. Can be a tuple, list, or single type. E.g. str or (int, str, float). If None provided, the most common key type found is kept.
skey

Current value-counts subkey.

Returns:the subkey
Return type:tuple
sum_counts

Sum of counts of all value-counts bins.

Returns:the sum of counts of all bins
Return type:float
sum_nonone_counts

Sum of not-none counts of all value-counts bins.

Returns:the sum of not-none counts of all bins
Return type:float

eskapade.analysis.histogram_filling module

Project: Eskapade - A python-based package for data analysis.

Class: HistogramFillerBase

Created: 2017/03/21

Description:
Algorithm to fill histogrammar sparse-bin histograms. It is possible to do cleaning of these histograms by rejecting certain keys or removing inconsistent data types. Timestamp columns are converted to nanoseconds before the binning is applied.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.histogram_filling.HistogramFillerBase(**kwargs)

Bases: escore.core.element.Link

Base class link to fill histograms.

It is possible to do after-filling cleaning of these histograms by rejecting certain keys or removing inconsistent data types. Timestamp columns are converted to nanoseconds before the binning is applied. Final histograms are stored in the datastore.

__init__(**kwargs)

Initialize link instance.

Store and do basic check on the attributes of link HistogramFillerBase.

Parameters:
  • name (str) – name of link
  • read_key (str) – key of input data to read from data store
  • store_key (str) – key of output data to store histograms in data store
  • columns (list) – colums to pick up from input data. (default is all columns)
  • bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns

Example bin_specs dictionary is:

>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0},
                 'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}}
Parameters:
  • var_dtype (dict) – dict of datatypes of the columns to study from dataframe. If not provided, try to determine datatypes directy from dataframe.
  • store_at_finalize (bool) – Store histograms in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
  • dict (drop_keys) – dictionary used for dropping specific keys from bins dictionaries of histograms

Example drop_keys dictionary is:

>>> drop_keys = {'x': [1,4,8,19],
                 'y': ['apple', 'pear', 'tomato'],
                 'x:y': [(1, 'apple'), (19, 'tomato')]}
assert_dataframe(df)

Check that input data is a filled pandas data frame.

Parameters:df – input (pandas) data frame
categorize_columns(df)

Categorize columns of dataframe by data type.

Parameters:df – input (pandas) data frame
drop_requested_keys(name, counts)

Drop requested keys from counts dictionary.

Parameters:
  • name (string) – key of drop_keys dict to get array of keys to be dropped
  • counts (dict) – counts dictionary to drop specific keys from
Returns:

count dict without dropped keys

execute()

Execute the link.

Execute() four things:

  • check presence and data type of requested columns
  • timestamp variables are converted to nanosec (integers)
  • do the actual value counting based on categories and created indices
  • then convert to histograms and add to datastore
fill_histogram(idf, c)

Fill input histogram with column(s) of input dataframe.

Parameters:
  • idf – input data frame used for filling histogram
  • c (list) – histogram column(s)
finalize()

Finalize the link.

Store Histograms here, if requested.

get_all_columns(data)

Retrieve all columns / keys from input data.

Parameters:data – input data sample (pandas dataframe or dict)
Returns:list of columns
Return type:list
get_data_type(df, col)

Get data type of dataframe column.

Parameters:
  • df – input data frame
  • col (str) – column
initialize()

Initialize the link.

process_and_store()

Store (and possibly process) histogram objects.

process_columns(df)

Process columns before histogram filling.

Specifically, convert timestamp columns to integers

Parameters:df – input (pandas) data frame
Returns:output (pandas) data frame with converted timestamp columns
Return type:pandas DataFrame
var_bin_specs(c, idx=0)

Determine bin_specs to use for variable c.

Parameters:
  • c (list) – list of variables, or string variable
  • idx (int) – index of the variable in c, for which to return the bin specs. default is 0.
Returns:

selected bin_specs of variable

eskapade.analysis.histogram_filling.only_bool(val)

Pass input value or array only if it is a bool.

Parameters:val – value to be evaluated
Returns:evaluated value
Return type:np.bool or np.ndarray
eskapade.analysis.histogram_filling.only_float(val)

Pass input val value or array only if it is a float.

Parameters:val – value to be evaluated
Returns:evaluated value
Return type:np.float64 or np.ndarray
eskapade.analysis.histogram_filling.only_int(val)

Pass input val value or array only if it is an integer.

Parameters:val – value to be evaluated
Returns:evaluated value
Return type:np.int64 or np.ndarray
eskapade.analysis.histogram_filling.only_str(val)

Pass input value or array only if it is a string.

Parameters:val – value to be evaluated
Returns:evaluated value
Return type:str or np.ndarray
eskapade.analysis.histogram_filling.to_ns(x)

Convert input timestamps to nanoseconds (integers).

Parameters:x – value to be converted
Returns:converted value
Return type:int
eskapade.analysis.histogram_filling.to_str(val)

Convert input to (array of) string(s).

Parameters:val – value to be converted
Returns:converted value
Return type:str or np.ndarray
eskapade.analysis.histogram_filling.value_to_bin_center(val, **kwargs)

Convert value to bin center.

Convert a numeric or timestamp column to a common bin center value.

Parameters:
  • bin_width – bin_width value needed to convert column to a common bin center value
  • bin_offset – bin_offset value needed to convert column to a common bin center value
eskapade.analysis.histogram_filling.value_to_bin_index(val, **kwargs)

Convert value to bin index.

Convert a numeric or timestamp column to an integer bin index.

Parameters:
  • bin_width – bin_width value needed to convert column to an integer bin index
  • bin_offset – bin_offset value needed to convert column to an integer bin index

eskapade.analysis.statistics module

Project: Eskapade - A python-based package for data analysis.

Classes: ArrayStats, GroupByStats

Created: 2017/03/21

Description:
Summary of an array.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.statistics.ArrayStats(data, col_name, weights=None, unit='', label='')

Bases: object

Create summary of an array.

Class to calculate statistics (mean, standard deviation, percentiles, etc.) and create a histogram of values in an array. The statistics can be returned as values in a dictionary, a printable string, or as a LaTeX string.

__init__(data, col_name, weights=None, unit='', label='')

Initialize for a single column in data frame.

Parameters:
  • data ((keys of) dict) – Input array
  • col_name – column name
  • weights (string (column of data)) – Input array (default None)
  • unit – Unit of column
  • label (str) – Label to describe column variable
Raises:

TypeError

create_mpv_stat()

Compute most probable value from histogram.

This function computes the most probable value based on the histogram from make_histogram(), and adds it to the statistics.

create_stats()

Compute statistical properties of column variable.

This function computes the statistical properties of values in the specified column. It is called by other functions that use the resulting figures to create a statistical overview.

get_col_props()

Get column properties.

Returns dict:Column properties
get_latex_table(get_stats=None, latex=True)

Get LaTeX code string for table of stats values.

Parameters:
  • get_stats (list) – List of statistics that you want to filter on. (default None (all stats)) Available stats are: ‘count’, ‘filled’, ‘distinct’, ‘mean’, ‘std’, ‘min’, ‘max’, ‘p05’, ‘p16’, ‘p50’, ‘p84’, ‘p95’, ‘p99’
  • latex (bool) – LaTeX output or list output (default True)
Returns str:

LaTeX code snippet

get_print_stats(to_output=False)

Get statistics in printable form.

Parameters:to_output (bool) – Print statistics to output stream?
Returns str:Printable statistics string
get_x_label()

Get x label.

logger

A logger that emits log messages to an observer.

The logger can be instantiated as a module or class attribute, e.g.

>>> logger = Logger()
>>> logger.info("I'm a module logger attribute.")
>>>
>>> class Point(object):
>>>     logger = Logger()
>>>
>>>     def __init__(self, x = 0.0, y = 0.0):
>>>         Point.logger.debug('Initializing {point} with x = {x}  y = {y}', point=Point, x=x, y=y)
>>>         self._x = x
>>>         self._y = y
>>>
>>>     @property
>>>     def x(self):
>>>         self.logger.debug('Getting property x = {point._x}', point=self)
>>>         return self._x
>>>
>>>     @x.setter
>>>     def x(self, x):
>>>         self.logger.debug('Setting property y = {point._x}', point=self)
>>>         self._x = x
>>>
>>>     @property
>>>     def y(self):
>>>        self.logger.debug('Getting property y = {point._y}', point=self)
>>>        return self._y
>>>
>>>     @y.setter
>>>     def y(self, y):
>>>         self.logger.debug('Setting property y = {point._y}', point=self)
>>>         self._y = y
>>>
>>> a_point = Point(1, 2)
>>>
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
>>> logger.log_level = LogLevel.DEBUG
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)

The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.

Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.

make_histogram(var_bins=30, var_range=None, bin_edges=None, create_mpv_stat=True)

Create histogram of column values.

Parameters:
  • var_bins (int) – Number of histogram bins
  • var_range (tuple) – Range of histogram variable
  • bin_edges (list) – predefined bin edges to use for histogram. Overrules var_bins.
class eskapade.analysis.statistics.GroupByStats(data, col_name, groupby=None, weights=None, unit='', label='')

Bases: eskapade.analysis.statistics.ArrayStats

Create summary of an array in groups.

__init__(data, col_name, groupby=None, weights=None, unit='', label='')

Initialize for a single column in dataframe.

Parameters:
  • data ((keys of) dict) – Input array
  • col_name – column name
  • weights (string (column of data)) – Input array (default None)
  • unit – Unit of column
  • label (str) – Label to describe column variable
  • groupby – column name
Raises:

TypeError

get_latex_table(get_stats=None)

Get LaTeX code string for group-by table of stats values.

Parameters:get_stats (list) – same as ArrayStats.get_latex_table get_stats key word.
Returns str:LaTeX code snippet
eskapade.analysis.statistics.get_col_props(var_type)

Get column properties.

Returns dict:Column properties
eskapade.analysis.statistics.weighted_quantile(data, weights=None, probability=0.5)

Compute the weighted quantile of a 1D numpy array.

Weighted quantiles, inspired by: https://github.com/nudomarinero/wquantiles/blob/master/wquantiles.py written by Jose Sabater Here updated to return multiple quantiles in one go. Now also works when weight is None.

Parameters:
  • data (ndarray) – input array (one dimension).
  • weights (ndarray) – array with the weights of the same size of data.
  • probability (ndarray) – array of quantiles to compute. Each probablity must have a value between 0 and 1.
Returns:

list of the output value(s).

Module contents