eskapade.analysis package¶

Subpackages¶

eskapade.analysis.links package

Submodules¶

eskapade.analysis.correlation module¶

Project: Eskapade - A python-based package for data analysis.

Created: 2018/06/23

Description:

Correlation related util functions.

Convert Pearson correlation value into a chi2 value of a contingency test matrix of a bivariate gaussion, and vice-versa. Calculation uses scipy’s mvn library. Calculates correlation coëfficients based on mutual_information, correlation_ratio, pearson, kendall or spearman methods.

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

eskapade.analysis.correlation.calculate_correlations(df, method)¶

Calculates correlation coefficients between every column pair.

Parameters:	df (pd.DataFrame) – input data frame method (str) – mutual_information, correlation_ratio, pearson, kendall or spearman, phik, significance
Returns:	pd.DataFrame

eskapade.analysis.correlation.chi2_from_rho(rho, n, subtract_from_chi2=0, corr0=None, sx=None, sy=None, nx=-1, ny=-1)¶

Calculate chi2-value of bivariate gauss having correlation value rho

Calculate no-noise chi2 value of bivar gauss with correlation rho, with respect to bivariate gauss without any correlation.

Returns float:	chi2 value

eskapade.analysis.correlation.rho_from_chi2(chi2, n, nx, ny, sx=None, sy=None)¶

correlation coefficient of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Returns float:	correlation coefficient

eskapade.analysis.datetime module¶

Project: Eskapade - A python-based package for data analysis.

Classes: TimePeriod, FreqTimePeriod

Created: 2017/03/14

Description:: Time period and time period with frequency.
Authors:: KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

modification, are permitted according to the terms listed in the file Redistribution and use in source and binary forms, with or without LICENSE.

class eskapade.analysis.datetime.FreqTimePeriod(**kwargs)¶

Bases: eskapade.analysis.datetime.TimePeriod

Time period with frequency.

__init__(**kwargs)¶: Initialize TimePeriod instance.

dt_string(period_index)¶

Convert period index into date/time string (start of period).

Parameters:	period_index (int) – specified period index value.

freq¶: Return frequency.

period_index(dt)¶

Return number of periods until date/time “dt” since 1970-01-01.

Parameters:	dt – specified date/time parameter

class eskapade.analysis.datetime.TimePeriod(**kwargs)¶

Bases: escore.core.mixin.ArgumentsMixin

Time period.

__init__(**kwargs)¶: Initialize TimePeriod instance.

logger¶

A logger that emits log messages to an observer.

The logger can be instantiated as a module or class attribute, e.g.

>>> logger = Logger()
>>> logger.info("I'm a module logger attribute.")
>>>
>>> class Point(object):
>>>     logger = Logger()
>>>
>>>     def __init__(self, x = 0.0, y = 0.0):
>>>         Point.logger.debug('Initializing {point} with x = {x}  y = {y}', point=Point, x=x, y=y)
>>>         self._x = x
>>>         self._y = y
>>>
>>>     @property
>>>     def x(self):
>>>         self.logger.debug('Getting property x = {point._x}', point=self)
>>>         return self._x
>>>
>>>     @x.setter
>>>     def x(self, x):
>>>         self.logger.debug('Setting property y = {point._x}', point=self)
>>>         self._x = x
>>>
>>>     @property
>>>     def y(self):
>>>        self.logger.debug('Getting property y = {point._y}', point=self)
>>>        return self._y
>>>
>>>     @y.setter
>>>     def y(self, y):
>>>         self.logger.debug('Setting property y = {point._y}', point=self)
>>>         self._y = y
>>>
>>> a_point = Point(1, 2)
>>>
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
>>> logger.log_level = LogLevel.DEBUG
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)

The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.

Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.

classmethod parse_date_time(dt)¶

Try to parse specified date/time.

Parameters:	dt – specified date/time

classmethod parse_time_period(period)¶

Try to parse specified time period.

Parameters:	period – specified period

period_index(dt)¶

Get number of periods until date/time “dt”.

Parameters:	dt – specified date/time

class eskapade.analysis.datetime.UniformTsTimePeriod(**kwargs)¶

Bases: eskapade.analysis.datetime.TimePeriod

Time period with offset.

__init__(**kwargs)¶: Initialize TimePeriod instance.

offset¶: Get offset parameter.

period¶: Get period parameter.

period_index(dt)¶

Get number of periods until date/time “dt” since “offset”, given specified “period”.

Parameters:	dt – specified date/time

eskapade.analysis.histogram module¶

Project: Eskapade - A python-based package for data analysis.

Classes: ValueCounts, BinningUtil, Histogram

Created: 2017/03/14

Description:: Generic 1D Histogram class.
Authors:: KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

modification, are permitted according to the terms listed in the file Redistribution and use in source and binary forms, with or without LICENSE.

class eskapade.analysis.histogram.BinningUtil(**kwargs)¶

Bases: object

Helper for interpreting bin specifications.

BinningUtil is a helper class used for interpreting bin specification dictionaries. It is a base class for the Histogram class.

__init__(**kwargs)¶

Initialize link instance.

A bin_specs dictionary needs to be provided as input. bins_specs is a dict containing ‘bin_width’ and ‘bin_offset’ keys. In case bins widths are not equal, bin_specs contains ‘bin_edges’ (array) instead of ‘bin_width’ and ‘bin_offset’. ‘bin_width’ and ‘bin_offset’ can be numeric or numpy timestamps.

Alternatively, bin_edges can be provided as input to bin_specs.

Example bin_specs dictionaries are:

>>> bin_specs = {'bin_width': 1, 'bin_offset': 0}
>>> bin_spect = {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}
>>> bin_specs = {'bin_width': np.timedelta64(30, 'D'),
                 'bin_offset': np.datetime64('2010-01-04')}

Parameters:	bin_specs (dict) – dictionary contains ‘bin_width’ and ‘bin_offset’ numbers or ‘bin_edges’ array. Default is None. bin_edges (list) – array with numpy histogram style bin_edges. Default is None.

bin_specs¶

Get bin_specs dictionary.

Returns:	bin_specs dictionary
Return type:	dict

get_bin_center(bin_label)¶

Return bin center for a given bin index.

Parameters:	bin_label – bin label for which to find the bin center
Returns:	bin center, can be float, int, timestamp

get_bin_edges()¶

Return bin edges.

Returns:	bin edges
Return type:	array

get_bin_edges_range()¶

Return bin range determined from bin edges.

Returns:	bin range
Return type:	tuple

get_left_bin_edge(bin_label)¶

Return left bin edge for a given bin index.

Parameters:	bin_label – bin label for which to find the left bin edge
Returns:	bin edge, can be float, int, timestamp

get_right_bin_edge(bin_label)¶

Return right bin edge for a given bin index.

Parameters:	bin_label – bin label for which to find the right bin edge.
Returns:	bin edge, can be float, int, timestamp

truncated_bin_edges(variable_range=None)¶

Bin edges corresponding to a given variable range.

Parameters:	variable_range (list) – variable range used for finding the right bin edges array. Optional.
Returns:	truncated bin edges
Return type:	array

value_to_bin_label(var_value, greater_equal=False)¶

Return bin index for given bin value.

Parameters:	var_value – variable value for which to find the bin index greater_equal (bool) – for float, int, timestamp, return index of bin for which value falls in range [lower edge, upper edge). If set to true, return index of bin for which value falls in range [lower edge, upper edge]. Default if false.
Returns:	bin index
Return type:	int

class eskapade.analysis.histogram.Histogram(counts, **kwargs)¶

Bases: eskapade.analysis.histogram.BinningUtil, escore.core.mixin.ArgumentsMixin

Generic 1D Histogram class.

Histogram holds bin labels (name of each bin), value_counts (values of the histogram) and a variable name. The bins can be categoric or numeric, where numeric includes timestamps. In case of numeric bins, bin_specs is set. bins_specs is a dict containing bin_width and bin_offset. In case bins widths are not equal, bin_specs contains bin_edges instead of bin_width and bin_offset.

__init__(counts, **kwargs)¶

Initialize Histogram instance.

A bin_specs dictionary can be provided as input. bins_specs is a dict containing ‘bin_width’ and ‘bin_offset’ keys. In case bins widths are not equal, bin_specs contains ‘bin_edges’ (array) instead of ‘bin_width’ and ‘bin_offset’. ‘bin_width’ and ‘bin_offset’ can be numeric or numpy timestamps.

Histogram counts can be specified as a ValueCounts object, a dictionary or a tuple:

tuple: Histogram((bin_values, bin_edges), variable=<your_variable_name>)
dict: a dictionary as comes out of pandas.series.value_counts() or pandas.Dataframe.groupby.size() over one variable.
ValueCounts: a ValueCounts object contains a value_counts dictionary.

Example bin_specs dictionaries are:

>>> bin_specs = { 'bin_width': 1, 'bin_offset': 0 }
>>> bin_spect = { 'bin_edges': [0,2,3,4,5,7,8] }
>>> bin_specs = { 'bin_width': np.timedelta64(30,'D'),
                  'bin_offset': np.datetime64('2010-01-04') }

Parameters:	counts – histogram counts bin_specs (dict) – dictionary contains ‘bin_width’ and ‘bin_offset’ numbers or ‘bin_edges’ array (default is None) variable (str) – name of the variable represented by the histogram datatype (type) – data type of the variable represented by the histogram (optional)

bin_centers()¶

Return bin centers.

Returns:	array of the bin centers
Return type:	array

bin_edges()¶

Return numpy style bin_edges array with uniform binning.

Returns:	array of all bin edges
Return type:	array

bin_entries()¶

Return number of bin entries.

Return the bin counts of the known bins in the value_counts object.

Returns:	array of the bin counts
Return type:	array

bin_labels()¶

Return bin labels.

Returns:	array of all bin labels
Return type:	array

classmethod combine_hists(hists, labels=False, rel_bin_width_tol=1e-06, **kwargs)¶

Combine a set of histograms.

Parameters:	hists (array) – array of Histograms to add up. labels (label) – histograms to add up have labels? (else are numeric) Default is False. variable (str) – name of variable described by the summed-up histogram rel_bin_width_tol (float) – relative tolerance between numeric bin edges.
Returns:	summed up histogram
Return type:	Histogram

copy(**kwargs)¶

Return a copy of this histogram.

Parameters:	variable (str) – assign new variable name

datatype¶

Data type of the variable represented by the histogram.

Returns:	data type
Return type:	type

get_bin_count(bin_label)¶

Get bin count for specific bin label.

Parameters:	bin_label – a specific key to find corresponding bin.
Returns:	bin counter value
Return type:	int

get_bin_labels()¶

Return all bin labels.

Returns:	array of all bin labels
Return type:	array

get_bin_range()¶

Return the bin range.

Returns:	tuple of the bin range found
Return type:	tuple

get_bin_vals(variable_range=None, combine_values=True)¶

Get bin labels/edges and corresponding bin counts.

Bin values corresponding to a given variable range.

Parameters:	variable_range (list) – variable range used for finding the right bins to get values from. Optional. combine_values (bool) – if bin_specs is not set, combine existing bin labels with variable range.
Returns:	two arrays of bin values and bin edges
Return type:	array

get_hist_val(var_value)¶

Get bin count for bin by value of histogram variable.

Parameters:	var_value – a specific value to find corresponding bin.
Returns:	bin counter value
Return type:	int

get_nonone_bin_centers()¶

Return bin centers.

Returns:	array of the bin centers
Return type:	array

get_nonone_bin_counts()¶

Return bin counts.

Returns:	array of the bin counts
Return type:	array

get_nonone_bin_edges()¶

Return numpy style bin-edges array.

Returns:	array of the bin edges
Return type:	array

get_nonone_bin_range()¶

Return the bin range.

Returns:	tuple of the bin range found
Return type:	tuple

get_uniform_bin_edges()¶

Return numpy style bin-edges array with uniform binning.

Returns:	array of all bin edges
Return type:	array

logger¶

A logger that emits log messages to an observer.

The logger can be instantiated as a module or class attribute, e.g.

>>> logger = Logger()
>>> logger.info("I'm a module logger attribute.")
>>>
>>> class Point(object):
>>>     logger = Logger()
>>>
>>>     def __init__(self, x = 0.0, y = 0.0):
>>>         Point.logger.debug('Initializing {point} with x = {x}  y = {y}', point=Point, x=x, y=y)
>>>         self._x = x
>>>         self._y = y
>>>
>>>     @property
>>>     def x(self):
>>>         self.logger.debug('Getting property x = {point._x}', point=self)
>>>         return self._x
>>>
>>>     @x.setter
>>>     def x(self, x):
>>>         self.logger.debug('Setting property y = {point._x}', point=self)
>>>         self._x = x
>>>
>>>     @property
>>>     def y(self):
>>>        self.logger.debug('Getting property y = {point._y}', point=self)
>>>        return self._y
>>>
>>>     @y.setter
>>>     def y(self, y):
>>>         self.logger.debug('Setting property y = {point._y}', point=self)
>>>         self._y = y
>>>
>>> a_point = Point(1, 2)
>>>
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
>>> logger.log_level = LogLevel.DEBUG
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)

The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.

Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.

n_bins¶

Number of bins in the ValueCounts object.

Returns:	number of bins
Return type:	int

n_dim¶

Number of histogram dimensions.

The number of histogram dimensions, which is equal to one by construction.

Returns:	number of dimensions
Return type:	int

num_bins¶

Number of bins.

Returns:	number of bins
Return type:	int

remove_keys_of_inconsistent_type(prefered_key_type=None)¶

Remove all keys that have inconsistent data type(s).

Parameters:	prefered_key_type (tuple) – the prefered key type to keep. Can be a tuple, list, or single type. E.g. str or (int,str,float). If None provided, the most common key type found is kept.

simulate(size, *args)¶

Simulate data using self (Histogram instance) as PDF.

see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html

Return numpy.array generated_data:
Parameters:	size (int) – number of data points to generate
	the generated data
Returns:	Histogram of the generated data
Return type:	Histogram

surface()¶

Calculate surface of the histogram.

Returns:	surface

to_normalized(**kwargs)¶

Return a normalized copy of this histogram.

Parameters:	new_var_name (str) – assign new variable name variable_range (list) – variable range used for finding the right bins to get values from. combine_values (bool) – if bin_specs is not set, combine existing bin labels with variable range.

variable¶

Name of variable represented by the histogram.

Returns:	variable name
Return type:	string

class eskapade.analysis.histogram.ValueCounts(key, subkey=None, counts=None, sel=None)¶

Bases: object

A dictionary of value counts.

The dictionary of value counts comes out of pandas.series.value_counts() for one variable or pandas.Dataframe.groupby.size() performed over one or multiple variables.

__init__(key, subkey=None, counts=None, sel=None)¶

Initialize link instance.

Parameters:

key (list) – key is a tuple, list or string of (the) variable name(s), matching those and the structure of the keys in the value_counts dictionary.
subkey (list) – subset of key. If provided, the value_counts dictionary will be projected from key onto the (subset of) subkey. E.g. use this to map a two dimensional value_counts dictionary onto one specified dimension. Default is None. Optional.
counts (dict) – the value_counts dictionary.
sel (dict) – Apply selections to value_counts dictionary. Default is {}. Optional.

count(value_bin)¶

Get bin count for specific bin-key value bin.

Parameters:	value_bin (tuple) – a specific key, and can be a list or tuple.
Returns:	specific bin counter value
Return type:	int

counts¶

Process and return value-counts dictionary.

Returns:	after processing, returns the value_counts dictionary
Return type:	dict

create_sub_counts(subkey, sel=None)¶

Project existing value counts onto a subset of keys.

E.g. map variables x,y onto single dimension x, so for each bin in x integrate over y.

Parameters:	subkey (tuple) – input sub-key, is a tuple, list, or string. This is the new key of variables for the returned ValueCounts object. sel (dict) – dictionary with selection. Optional.
Returns:	value_counts object where subkey has become the new key.
Return type:	ValueCounts

get_values(val_keys=())¶

Get all key-values of a subset of keys.

E.g. give all x values in of the keys, when the value_counts object has keys (x, y).

Parameters:	val_keys (tuple) – a specific sub-key to get key values for.
Returns:	all key-values of a subset of keys.
Return type:	tuple

key¶

Process and return current value-counts key.

Returns:	the key
Return type:	tuple

nononecounts¶

Return value-counts dictionary without None keys.

Returns:	after processing, returns the value_counts dictionary without None keys
Return type:	dict

num_bins¶

Number of value-counts bins.

Returns:	number of bins
Return type:	int

num_nonone_bins¶

Number of not-none value-counts bins.

Returns:	number of not-none bins
Return type:	int

process_counts(accept_equiv=True)¶

Project value counts onto the existing subset of keys.

E.g. map variables x,y onto single dimension x, so for each bin in x integrate over y.

Parameters:	accept_equiv (bool) – accept equivalence of key and subkey if if subkey is in different order than key. Default is true.
Returns:	successful projection or not
Return type:	bool

remove_keys_of_inconsistent_type(prefered_key_type=None)¶

Remove keys with inconsistent data type(s).

Parameters:	prefered_key_type (tuple) – the prefered key type to keep. Can be a tuple, list, or single type. E.g. str or (int, str, float). If None provided, the most common key type found is kept.

skey¶

Current value-counts subkey.

Returns:	the subkey
Return type:	tuple

sum_counts¶

Sum of counts of all value-counts bins.

Returns:	the sum of counts of all bins
Return type:	float

sum_nonone_counts¶

Sum of not-none counts of all value-counts bins.

Returns:	the sum of not-none counts of all bins
Return type:	float

eskapade.analysis.histogram_filling module¶

Project: Eskapade - A python-based package for data analysis.

Class: HistogramFillerBase

Created: 2017/03/21

Description:: Algorithm to fill histogrammar sparse-bin histograms. It is possible to do cleaning of these histograms by rejecting certain keys or removing inconsistent data types. Timestamp columns are converted to nanoseconds before the binning is applied.
Authors:: KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.histogram_filling.HistogramFillerBase(**kwargs)¶

Bases: escore.core.element.Link

Base class link to fill histograms.

It is possible to do after-filling cleaning of these histograms by rejecting certain keys or removing inconsistent data types. Timestamp columns are converted to nanoseconds before the binning is applied. Final histograms are stored in the datastore.

__init__(**kwargs)¶

Initialize link instance.

Store and do basic check on the attributes of link HistogramFillerBase.

Parameters:	name (str) – name of link read_key (str) – key of input data to read from data store store_key (str) – key of output data to store histograms in data store columns (list) – colums to pick up from input data. (default is all columns) bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns

Example bin_specs dictionary is:

>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0},
                 'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}}

Parameters:

var_dtype (dict) – dict of datatypes of the columns to study from dataframe. If not provided, try to determine datatypes directy from dataframe.
store_at_finalize (bool) – Store histograms in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
dict (drop_keys) – dictionary used for dropping specific keys from bins dictionaries of histograms

Example drop_keys dictionary is:

>>> drop_keys = {'x': [1,4,8,19],
                 'y': ['apple', 'pear', 'tomato'],
                 'x:y': [(1, 'apple'), (19, 'tomato')]}

assert_dataframe(df)¶

Check that input data is a filled pandas data frame.

Parameters:	df – input (pandas) data frame

categorize_columns(df)¶

Categorize columns of dataframe by data type.

Parameters:	df – input (pandas) data frame

drop_requested_keys(name, counts)¶

Drop requested keys from counts dictionary.

Parameters:	name (string) – key of drop_keys dict to get array of keys to be dropped counts (dict) – counts dictionary to drop specific keys from
Returns:	count dict without dropped keys

execute()¶

Execute the link.

Execute() four things:

check presence and data type of requested columns
timestamp variables are converted to nanosec (integers)
do the actual value counting based on categories and created indices
then convert to histograms and add to datastore

fill_histogram(idf, c)¶

Fill input histogram with column(s) of input dataframe.

Parameters:	idf – input data frame used for filling histogram c (list) – histogram column(s)

finalize()¶

Finalize the link.

Store Histograms here, if requested.

get_all_columns(data)¶

Retrieve all columns / keys from input data.

Parameters:	data – input data sample (pandas dataframe or dict)
Returns:	list of columns
Return type:	list

get_data_type(df, col)¶

Get data type of dataframe column.

Parameters:	df – input data frame col (str) – column

initialize()¶: Initialize the link.

process_and_store()¶: Store (and possibly process) histogram objects.

process_columns(df)¶

Process columns before histogram filling.

Specifically, convert timestamp columns to integers

Parameters:	df – input (pandas) data frame
Returns:	output (pandas) data frame with converted timestamp columns
Return type:	pandas DataFrame

var_bin_specs(c, idx=0)¶

Determine bin_specs to use for variable c.

Parameters:	c (list) – list of variables, or string variable idx (int) – index of the variable in c, for which to return the bin specs. default is 0.
Returns:	selected bin_specs of variable

eskapade.analysis.histogram_filling.only_bool(val)¶

Pass input value or array only if it is a bool.

Parameters:	val – value to be evaluated
Returns:	evaluated value
Return type:	np.bool or np.ndarray

eskapade.analysis.histogram_filling.only_float(val)¶

Pass input val value or array only if it is a float.

Parameters:	val – value to be evaluated
Returns:	evaluated value
Return type:	np.float64 or np.ndarray

eskapade.analysis.histogram_filling.only_int(val)¶

Pass input val value or array only if it is an integer.

Parameters:	val – value to be evaluated
Returns:	evaluated value
Return type:	np.int64 or np.ndarray

eskapade.analysis.histogram_filling.only_str(val)¶

Pass input value or array only if it is a string.

Parameters:	val – value to be evaluated
Returns:	evaluated value
Return type:	str or np.ndarray

eskapade.analysis.histogram_filling.to_ns(x)¶

Convert input timestamps to nanoseconds (integers).

Parameters:	x – value to be converted
Returns:	converted value
Return type:	int

eskapade.analysis.histogram_filling.to_str(val)¶

Convert input to (array of) string(s).

Parameters:	val – value to be converted
Returns:	converted value
Return type:	str or np.ndarray

eskapade.analysis.histogram_filling.value_to_bin_center(val, **kwargs)¶

Convert value to bin center.

Convert a numeric or timestamp column to a common bin center value.

Parameters:	bin_width – bin_width value needed to convert column to a common bin center value bin_offset – bin_offset value needed to convert column to a common bin center value

eskapade.analysis.histogram_filling.value_to_bin_index(val, **kwargs)¶

Convert value to bin index.

Convert a numeric or timestamp column to an integer bin index.

Parameters:	bin_width – bin_width value needed to convert column to an integer bin index bin_offset – bin_offset value needed to convert column to an integer bin index

eskapade.analysis.statistics module¶

Project: Eskapade - A python-based package for data analysis.

Classes: ArrayStats, GroupByStats

Created: 2017/03/21

Description:: Summary of an array.
Authors:: KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.statistics.ArrayStats(data, col_name, weights=None, unit='', label='')¶

Bases: object

Create summary of an array.

Class to calculate statistics (mean, standard deviation, percentiles, etc.) and create a histogram of values in an array. The statistics can be returned as values in a dictionary, a printable string, or as a LaTeX string.

__init__(data, col_name, weights=None, unit='', label='')¶

Initialize for a single column in data frame.

Parameters:	data ((keys of) dict) – Input array col_name – column name weights (string (column of data)) – Input array (default None) unit – Unit of column label (str) – Label to describe column variable
Raises:	TypeError

create_mpv_stat()¶

Compute most probable value from histogram.

This function computes the most probable value based on the histogram from make_histogram(), and adds it to the statistics.

create_stats()¶

Compute statistical properties of column variable.

This function computes the statistical properties of values in the specified column. It is called by other functions that use the resulting figures to create a statistical overview.

get_col_props()¶

Get column properties.

Returns dict:	Column properties

get_latex_table(get_stats=None, latex=True)¶

Get LaTeX code string for table of stats values.

Parameters:	get_stats (list) – List of statistics that you want to filter on. (default None (all stats)) Available stats are: ‘count’, ‘filled’, ‘distinct’, ‘mean’, ‘std’, ‘min’, ‘max’, ‘p05’, ‘p16’, ‘p50’, ‘p84’, ‘p95’, ‘p99’ latex (bool) – LaTeX output or list output (default True)
Returns str:	LaTeX code snippet

get_print_stats(to_output=False)¶

Get statistics in printable form.

Parameters:	to_output (bool) – Print statistics to output stream?
Returns str:	Printable statistics string

get_x_label()¶: Get x label.

logger¶

A logger that emits log messages to an observer.

The logger can be instantiated as a module or class attribute, e.g.

>>> logger = Logger()
>>> logger.info("I'm a module logger attribute.")
>>>
>>> class Point(object):
>>>     logger = Logger()
>>>
>>>     def __init__(self, x = 0.0, y = 0.0):
>>>         Point.logger.debug('Initializing {point} with x = {x}  y = {y}', point=Point, x=x, y=y)
>>>         self._x = x
>>>         self._y = y
>>>
>>>     @property
>>>     def x(self):
>>>         self.logger.debug('Getting property x = {point._x}', point=self)
>>>         return self._x
>>>
>>>     @x.setter
>>>     def x(self, x):
>>>         self.logger.debug('Setting property y = {point._x}', point=self)
>>>         self._x = x
>>>
>>>     @property
>>>     def y(self):
>>>        self.logger.debug('Getting property y = {point._y}', point=self)
>>>        return self._y
>>>
>>>     @y.setter
>>>     def y(self, y):
>>>         self.logger.debug('Setting property y = {point._y}', point=self)
>>>         self._y = y
>>>
>>> a_point = Point(1, 2)
>>>
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
>>> logger.log_level = LogLevel.DEBUG
>>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)

The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.

Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.

make_histogram(var_bins=30, var_range=None, bin_edges=None, create_mpv_stat=True)¶

Create histogram of column values.

Parameters:	var_bins (int) – Number of histogram bins var_range (tuple) – Range of histogram variable bin_edges (list) – predefined bin edges to use for histogram. Overrules var_bins.

class eskapade.analysis.statistics.GroupByStats(data, col_name, groupby=None, weights=None, unit='', label='')¶

Bases: eskapade.analysis.statistics.ArrayStats

Create summary of an array in groups.

__init__(data, col_name, groupby=None, weights=None, unit='', label='')¶

Initialize for a single column in dataframe.

Parameters:	data ((keys of) dict) – Input array col_name – column name weights (string (column of data)) – Input array (default None) unit – Unit of column label (str) – Label to describe column variable groupby – column name
Raises:	TypeError

get_latex_table(get_stats=None)¶

Get LaTeX code string for group-by table of stats values.

Parameters:	get_stats (list) – same as ArrayStats.get_latex_table get_stats key word.
Returns str:	LaTeX code snippet

eskapade.analysis.statistics.get_col_props(var_type)¶

Get column properties.

Returns dict:	Column properties

eskapade.analysis.statistics.weighted_quantile(data, weights=None, probability=0.5)¶

Compute the weighted quantile of a 1D numpy array.

Weighted quantiles, inspired by: https://github.com/nudomarinero/wquantiles/blob/master/wquantiles.py written by Jose Sabater Here updated to return multiple quantiles in one go. Now also works when weight is None.

Parameters:	data (ndarray) – input array (one dimension). weights (ndarray) – array with the weights of the same size of data. probability (ndarray) – array of quantiles to compute. Each probablity must have a value between 0 and 1.
Returns:	list of the output value(s).

eskapade.analysis package¶

Subpackages¶

Submodules¶

eskapade.analysis.correlation module¶

eskapade.analysis.datetime module¶

eskapade.analysis.histogram module¶

eskapade.analysis.histogram_filling module¶

eskapade.analysis.statistics module¶

Module contents¶