eskapade.analysis package¶
Subpackages¶
- eskapade.analysis.links package
- Submodules
- eskapade.analysis.links.apply_func_to_df module
- eskapade.analysis.links.apply_selection_to_df module
- eskapade.analysis.links.basic_generator module
- eskapade.analysis.links.df_concatenator module
- eskapade.analysis.links.df_merger module
- eskapade.analysis.links.histogrammar_filler module
- eskapade.analysis.links.random_sample_splitter module
- eskapade.analysis.links.read_to_df module
- eskapade.analysis.links.record_factorizer module
- eskapade.analysis.links.record_vectorizer module
- eskapade.analysis.links.value_counter module
- eskapade.analysis.links.write_from_df module
- Module contents
Submodules¶
eskapade.analysis.correlation module¶
Project: Eskapade - A python-based package for data analysis.
Created: 2018/06/23
- Description:
Correlation related util functions.
Convert Pearson correlation value into a chi2 value of a contingency test matrix of a bivariate gaussion, and vice-versa. Calculation uses scipy’s mvn library. Calculates correlation coëfficients based on mutual_information, correlation_ratio, pearson, kendall or spearman methods.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
eskapade.analysis.correlation.
calculate_correlations
(df, method)¶ Calculates correlation coefficients between every column pair.
Parameters: - df (pd.DataFrame) – input data frame
- method (str) – mutual_information, correlation_ratio, pearson, kendall or spearman, phik, significance
Returns: pd.DataFrame
-
eskapade.analysis.correlation.
chi2_from_rho
(rho, n, subtract_from_chi2=0, corr0=None, sx=None, sy=None, nx=-1, ny=-1)¶ Calculate chi2-value of bivariate gauss having correlation value rho
Calculate no-noise chi2 value of bivar gauss with correlation rho, with respect to bivariate gauss without any correlation.
Returns float: chi2 value
-
eskapade.analysis.correlation.
rho_from_chi2
(chi2, n, nx, ny, sx=None, sy=None)¶ correlation coefficient of bivariate gaussian derived from chi2-value
Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.
Bivariate gaussian’s range is set to [-5,5] by construction.
Returns float: correlation coefficient
eskapade.analysis.datetime module¶
Project: Eskapade - A python-based package for data analysis.
Classes: TimePeriod, FreqTimePeriod
Created: 2017/03/14
- Description:
- Time period and time period with frequency.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
modification, are permitted according to the terms listed in the file Redistribution and use in source and binary forms, with or without LICENSE.
-
class
eskapade.analysis.datetime.
FreqTimePeriod
(**kwargs)¶ Bases:
eskapade.analysis.datetime.TimePeriod
Time period with frequency.
-
__init__
(**kwargs)¶ Initialize TimePeriod instance.
-
dt_string
(period_index)¶ Convert period index into date/time string (start of period).
Parameters: period_index (int) – specified period index value.
-
freq
¶ Return frequency.
-
period_index
(dt)¶ Return number of periods until date/time “dt” since 1970-01-01.
Parameters: dt – specified date/time parameter
-
-
class
eskapade.analysis.datetime.
TimePeriod
(**kwargs)¶ Bases:
escore.core.mixin.ArgumentsMixin
Time period.
-
__init__
(**kwargs)¶ Initialize TimePeriod instance.
-
logger
¶ A logger that emits log messages to an observer.
The logger can be instantiated as a module or class attribute, e.g.
>>> logger = Logger() >>> logger.info("I'm a module logger attribute.") >>> >>> class Point(object): >>> logger = Logger() >>> >>> def __init__(self, x = 0.0, y = 0.0): >>> Point.logger.debug('Initializing {point} with x = {x} y = {y}', point=Point, x=x, y=y) >>> self._x = x >>> self._y = y >>> >>> @property >>> def x(self): >>> self.logger.debug('Getting property x = {point._x}', point=self) >>> return self._x >>> >>> @x.setter >>> def x(self, x): >>> self.logger.debug('Setting property y = {point._x}', point=self) >>> self._x = x >>> >>> @property >>> def y(self): >>> self.logger.debug('Getting property y = {point._y}', point=self) >>> return self._y >>> >>> @y.setter >>> def y(self, y): >>> self.logger.debug('Setting property y = {point._y}', point=self) >>> self._y = y >>> >>> a_point = Point(1, 2) >>> >>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point) >>> logger.log_level = LogLevel.DEBUG >>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.
Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.
-
classmethod
parse_date_time
(dt)¶ Try to parse specified date/time.
Parameters: dt – specified date/time
-
classmethod
parse_time_period
(period)¶ Try to parse specified time period.
Parameters: period – specified period
-
period_index
(dt)¶ Get number of periods until date/time “dt”.
Parameters: dt – specified date/time
-
-
class
eskapade.analysis.datetime.
UniformTsTimePeriod
(**kwargs)¶ Bases:
eskapade.analysis.datetime.TimePeriod
Time period with offset.
-
__init__
(**kwargs)¶ Initialize TimePeriod instance.
-
offset
¶ Get offset parameter.
-
period
¶ Get period parameter.
-
period_index
(dt)¶ Get number of periods until date/time “dt” since “offset”, given specified “period”.
Parameters: dt – specified date/time
-
eskapade.analysis.histogram module¶
Project: Eskapade - A python-based package for data analysis.
Classes: ValueCounts, BinningUtil, Histogram
Created: 2017/03/14
- Description:
- Generic 1D Histogram class.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
modification, are permitted according to the terms listed in the file Redistribution and use in source and binary forms, with or without LICENSE.
-
class
eskapade.analysis.histogram.
BinningUtil
(**kwargs)¶ Bases:
object
Helper for interpreting bin specifications.
BinningUtil is a helper class used for interpreting bin specification dictionaries. It is a base class for the Histogram class.
-
__init__
(**kwargs)¶ Initialize link instance.
A bin_specs dictionary needs to be provided as input. bins_specs is a dict containing ‘bin_width’ and ‘bin_offset’ keys. In case bins widths are not equal, bin_specs contains ‘bin_edges’ (array) instead of ‘bin_width’ and ‘bin_offset’. ‘bin_width’ and ‘bin_offset’ can be numeric or numpy timestamps.
Alternatively, bin_edges can be provided as input to bin_specs.
Example bin_specs dictionaries are:
>>> bin_specs = {'bin_width': 1, 'bin_offset': 0} >>> bin_spect = {'bin_edges': [0, 2, 3, 4, 5, 7, 8]} >>> bin_specs = {'bin_width': np.timedelta64(30, 'D'), 'bin_offset': np.datetime64('2010-01-04')}
Parameters: - bin_specs (dict) – dictionary contains ‘bin_width’ and ‘bin_offset’ numbers or ‘bin_edges’ array. Default is None.
- bin_edges (list) – array with numpy histogram style bin_edges. Default is None.
-
bin_specs
¶ Get bin_specs dictionary.
Returns: bin_specs dictionary Return type: dict
-
get_bin_center
(bin_label)¶ Return bin center for a given bin index.
Parameters: bin_label – bin label for which to find the bin center Returns: bin center, can be float, int, timestamp
-
get_bin_edges
()¶ Return bin edges.
Returns: bin edges Return type: array
-
get_bin_edges_range
()¶ Return bin range determined from bin edges.
Returns: bin range Return type: tuple
-
get_left_bin_edge
(bin_label)¶ Return left bin edge for a given bin index.
Parameters: bin_label – bin label for which to find the left bin edge Returns: bin edge, can be float, int, timestamp
-
get_right_bin_edge
(bin_label)¶ Return right bin edge for a given bin index.
Parameters: bin_label – bin label for which to find the right bin edge. Returns: bin edge, can be float, int, timestamp
-
truncated_bin_edges
(variable_range=None)¶ Bin edges corresponding to a given variable range.
Parameters: variable_range (list) – variable range used for finding the right bin edges array. Optional. Returns: truncated bin edges Return type: array
-
value_to_bin_label
(var_value, greater_equal=False)¶ Return bin index for given bin value.
Parameters: - var_value – variable value for which to find the bin index
- greater_equal (bool) – for float, int, timestamp, return index of bin for which value falls in range [lower edge, upper edge). If set to true, return index of bin for which value falls in range [lower edge, upper edge]. Default if false.
Returns: bin index
Return type: int
-
-
class
eskapade.analysis.histogram.
Histogram
(counts, **kwargs)¶ Bases:
eskapade.analysis.histogram.BinningUtil
,escore.core.mixin.ArgumentsMixin
Generic 1D Histogram class.
Histogram holds bin labels (name of each bin), value_counts (values of the histogram) and a variable name. The bins can be categoric or numeric, where numeric includes timestamps. In case of numeric bins, bin_specs is set. bins_specs is a dict containing bin_width and bin_offset. In case bins widths are not equal, bin_specs contains bin_edges instead of bin_width and bin_offset.
-
__init__
(counts, **kwargs)¶ Initialize Histogram instance.
A bin_specs dictionary can be provided as input. bins_specs is a dict containing ‘bin_width’ and ‘bin_offset’ keys. In case bins widths are not equal, bin_specs contains ‘bin_edges’ (array) instead of ‘bin_width’ and ‘bin_offset’. ‘bin_width’ and ‘bin_offset’ can be numeric or numpy timestamps.
Histogram counts can be specified as a ValueCounts object, a dictionary or a tuple:
- tuple: Histogram((bin_values, bin_edges), variable=<your_variable_name>)
- dict: a dictionary as comes out of pandas.series.value_counts() or pandas.Dataframe.groupby.size() over one variable.
- ValueCounts: a ValueCounts object contains a value_counts dictionary.
Example bin_specs dictionaries are:
>>> bin_specs = { 'bin_width': 1, 'bin_offset': 0 } >>> bin_spect = { 'bin_edges': [0,2,3,4,5,7,8] } >>> bin_specs = { 'bin_width': np.timedelta64(30,'D'), 'bin_offset': np.datetime64('2010-01-04') }
Parameters: - counts – histogram counts
- bin_specs (dict) – dictionary contains ‘bin_width’ and ‘bin_offset’ numbers or ‘bin_edges’ array (default is None)
- variable (str) – name of the variable represented by the histogram
- datatype (type) – data type of the variable represented by the histogram (optional)
-
bin_centers
()¶ Return bin centers.
Returns: array of the bin centers Return type: array
-
bin_edges
()¶ Return numpy style bin_edges array with uniform binning.
Returns: array of all bin edges Return type: array
-
bin_entries
()¶ Return number of bin entries.
Return the bin counts of the known bins in the value_counts object.
Returns: array of the bin counts Return type: array
-
bin_labels
()¶ Return bin labels.
Returns: array of all bin labels Return type: array
-
classmethod
combine_hists
(hists, labels=False, rel_bin_width_tol=1e-06, **kwargs)¶ Combine a set of histograms.
Parameters: - hists (array) – array of Histograms to add up.
- labels (label) – histograms to add up have labels? (else are numeric) Default is False.
- variable (str) – name of variable described by the summed-up histogram
- rel_bin_width_tol (float) – relative tolerance between numeric bin edges.
Returns: summed up histogram
Return type:
-
copy
(**kwargs)¶ Return a copy of this histogram.
Parameters: variable (str) – assign new variable name
-
datatype
¶ Data type of the variable represented by the histogram.
Returns: data type Return type: type
-
get_bin_count
(bin_label)¶ Get bin count for specific bin label.
Parameters: bin_label – a specific key to find corresponding bin. Returns: bin counter value Return type: int
-
get_bin_labels
()¶ Return all bin labels.
Returns: array of all bin labels Return type: array
-
get_bin_range
()¶ Return the bin range.
Returns: tuple of the bin range found Return type: tuple
-
get_bin_vals
(variable_range=None, combine_values=True)¶ Get bin labels/edges and corresponding bin counts.
Bin values corresponding to a given variable range.
Parameters: - variable_range (list) – variable range used for finding the right bins to get values from. Optional.
- combine_values (bool) – if bin_specs is not set, combine existing bin labels with variable range.
Returns: two arrays of bin values and bin edges
Return type: array
-
get_hist_val
(var_value)¶ Get bin count for bin by value of histogram variable.
Parameters: var_value – a specific value to find corresponding bin. Returns: bin counter value Return type: int
-
get_nonone_bin_centers
()¶ Return bin centers.
Returns: array of the bin centers Return type: array
-
get_nonone_bin_counts
()¶ Return bin counts.
Returns: array of the bin counts Return type: array
-
get_nonone_bin_edges
()¶ Return numpy style bin-edges array.
Returns: array of the bin edges Return type: array
-
get_nonone_bin_range
()¶ Return the bin range.
Returns: tuple of the bin range found Return type: tuple
-
get_uniform_bin_edges
()¶ Return numpy style bin-edges array with uniform binning.
Returns: array of all bin edges Return type: array
-
logger
¶ A logger that emits log messages to an observer.
The logger can be instantiated as a module or class attribute, e.g.
>>> logger = Logger() >>> logger.info("I'm a module logger attribute.") >>> >>> class Point(object): >>> logger = Logger() >>> >>> def __init__(self, x = 0.0, y = 0.0): >>> Point.logger.debug('Initializing {point} with x = {x} y = {y}', point=Point, x=x, y=y) >>> self._x = x >>> self._y = y >>> >>> @property >>> def x(self): >>> self.logger.debug('Getting property x = {point._x}', point=self) >>> return self._x >>> >>> @x.setter >>> def x(self, x): >>> self.logger.debug('Setting property y = {point._x}', point=self) >>> self._x = x >>> >>> @property >>> def y(self): >>> self.logger.debug('Getting property y = {point._y}', point=self) >>> return self._y >>> >>> @y.setter >>> def y(self, y): >>> self.logger.debug('Setting property y = {point._y}', point=self) >>> self._y = y >>> >>> a_point = Point(1, 2) >>> >>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point) >>> logger.log_level = LogLevel.DEBUG >>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.
Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.
-
n_bins
¶ Number of bins in the ValueCounts object.
Returns: number of bins Return type: int
-
n_dim
¶ Number of histogram dimensions.
The number of histogram dimensions, which is equal to one by construction.
Returns: number of dimensions Return type: int
-
num_bins
¶ Number of bins.
Returns: number of bins Return type: int
-
remove_keys_of_inconsistent_type
(prefered_key_type=None)¶ Remove all keys that have inconsistent data type(s).
Parameters: prefered_key_type (tuple) – the prefered key type to keep. Can be a tuple, list, or single type. E.g. str or (int,str,float). If None provided, the most common key type found is kept.
-
simulate
(size, *args)¶ Simulate data using self (Histogram instance) as PDF.
see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html
Parameters: size (int) – number of data points to generate Return numpy.array generated_data: the generated data Returns: Histogram of the generated data Return type: Histogram
-
surface
()¶ Calculate surface of the histogram.
Returns: surface
-
to_normalized
(**kwargs)¶ Return a normalized copy of this histogram.
Parameters: - new_var_name (str) – assign new variable name
- variable_range (list) – variable range used for finding the right bins to get values from.
- combine_values (bool) – if bin_specs is not set, combine existing bin labels with variable range.
-
variable
¶ Name of variable represented by the histogram.
Returns: variable name Return type: string
-
-
class
eskapade.analysis.histogram.
ValueCounts
(key, subkey=None, counts=None, sel=None)¶ Bases:
object
A dictionary of value counts.
The dictionary of value counts comes out of pandas.series.value_counts() for one variable or pandas.Dataframe.groupby.size() performed over one or multiple variables.
-
__init__
(key, subkey=None, counts=None, sel=None)¶ Initialize link instance.
Parameters: - key (list) – key is a tuple, list or string of (the) variable name(s), matching those and the structure of the keys in the value_counts dictionary.
- subkey (list) – subset of key. If provided, the value_counts dictionary will be projected from key onto the (subset of) subkey. E.g. use this to map a two dimensional value_counts dictionary onto one specified dimension. Default is None. Optional.
- counts (dict) – the value_counts dictionary.
- sel (dict) – Apply selections to value_counts dictionary. Default is {}. Optional.
-
count
(value_bin)¶ Get bin count for specific bin-key value bin.
Parameters: value_bin (tuple) – a specific key, and can be a list or tuple. Returns: specific bin counter value Return type: int
-
counts
¶ Process and return value-counts dictionary.
Returns: after processing, returns the value_counts dictionary Return type: dict
-
create_sub_counts
(subkey, sel=None)¶ Project existing value counts onto a subset of keys.
E.g. map variables x,y onto single dimension x, so for each bin in x integrate over y.
Parameters: - subkey (tuple) – input sub-key, is a tuple, list, or string. This is the new key of variables for the returned ValueCounts object.
- sel (dict) – dictionary with selection. Optional.
Returns: value_counts object where subkey has become the new key.
Return type:
-
get_values
(val_keys=())¶ Get all key-values of a subset of keys.
E.g. give all x values in of the keys, when the value_counts object has keys (x, y).
Parameters: val_keys (tuple) – a specific sub-key to get key values for. Returns: all key-values of a subset of keys. Return type: tuple
-
key
¶ Process and return current value-counts key.
Returns: the key Return type: tuple
-
nononecounts
¶ Return value-counts dictionary without None keys.
Returns: after processing, returns the value_counts dictionary without None keys Return type: dict
-
num_bins
¶ Number of value-counts bins.
Returns: number of bins Return type: int
-
num_nonone_bins
¶ Number of not-none value-counts bins.
Returns: number of not-none bins Return type: int
-
process_counts
(accept_equiv=True)¶ Project value counts onto the existing subset of keys.
E.g. map variables x,y onto single dimension x, so for each bin in x integrate over y.
Parameters: accept_equiv (bool) – accept equivalence of key and subkey if if subkey is in different order than key. Default is true. Returns: successful projection or not Return type: bool
-
remove_keys_of_inconsistent_type
(prefered_key_type=None)¶ Remove keys with inconsistent data type(s).
Parameters: prefered_key_type (tuple) – the prefered key type to keep. Can be a tuple, list, or single type. E.g. str or (int, str, float). If None provided, the most common key type found is kept.
-
skey
¶ Current value-counts subkey.
Returns: the subkey Return type: tuple
-
sum_counts
¶ Sum of counts of all value-counts bins.
Returns: the sum of counts of all bins Return type: float
-
sum_nonone_counts
¶ Sum of not-none counts of all value-counts bins.
Returns: the sum of not-none counts of all bins Return type: float
-
eskapade.analysis.histogram_filling module¶
Project: Eskapade - A python-based package for data analysis.
Class: HistogramFillerBase
Created: 2017/03/21
- Description:
- Algorithm to fill histogrammar sparse-bin histograms. It is possible to do cleaning of these histograms by rejecting certain keys or removing inconsistent data types. Timestamp columns are converted to nanoseconds before the binning is applied.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.histogram_filling.
HistogramFillerBase
(**kwargs)¶ Bases:
escore.core.element.Link
Base class link to fill histograms.
It is possible to do after-filling cleaning of these histograms by rejecting certain keys or removing inconsistent data types. Timestamp columns are converted to nanoseconds before the binning is applied. Final histograms are stored in the datastore.
-
__init__
(**kwargs)¶ Initialize link instance.
Store and do basic check on the attributes of link HistogramFillerBase.
Parameters: - name (str) – name of link
- read_key (str) – key of input data to read from data store
- store_key (str) – key of output data to store histograms in data store
- columns (list) – colums to pick up from input data. (default is all columns)
- bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns
Example bin_specs dictionary is:
>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0}, 'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}}
Parameters: - var_dtype (dict) – dict of datatypes of the columns to study from dataframe. If not provided, try to determine datatypes directy from dataframe.
- store_at_finalize (bool) – Store histograms in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
- dict (drop_keys) – dictionary used for dropping specific keys from bins dictionaries of histograms
Example drop_keys dictionary is:
>>> drop_keys = {'x': [1,4,8,19], 'y': ['apple', 'pear', 'tomato'], 'x:y': [(1, 'apple'), (19, 'tomato')]}
-
assert_dataframe
(df)¶ Check that input data is a filled pandas data frame.
Parameters: df – input (pandas) data frame
-
categorize_columns
(df)¶ Categorize columns of dataframe by data type.
Parameters: df – input (pandas) data frame
-
drop_requested_keys
(name, counts)¶ Drop requested keys from counts dictionary.
Parameters: - name (string) – key of drop_keys dict to get array of keys to be dropped
- counts (dict) – counts dictionary to drop specific keys from
Returns: count dict without dropped keys
-
execute
()¶ Execute the link.
Execute() four things:
- check presence and data type of requested columns
- timestamp variables are converted to nanosec (integers)
- do the actual value counting based on categories and created indices
- then convert to histograms and add to datastore
-
fill_histogram
(idf, c)¶ Fill input histogram with column(s) of input dataframe.
Parameters: - idf – input data frame used for filling histogram
- c (list) – histogram column(s)
-
finalize
()¶ Finalize the link.
Store Histograms here, if requested.
-
get_all_columns
(data)¶ Retrieve all columns / keys from input data.
Parameters: data – input data sample (pandas dataframe or dict) Returns: list of columns Return type: list
-
get_data_type
(df, col)¶ Get data type of dataframe column.
Parameters: - df – input data frame
- col (str) – column
-
initialize
()¶ Initialize the link.
-
process_and_store
()¶ Store (and possibly process) histogram objects.
-
process_columns
(df)¶ Process columns before histogram filling.
Specifically, convert timestamp columns to integers
Parameters: df – input (pandas) data frame Returns: output (pandas) data frame with converted timestamp columns Return type: pandas DataFrame
-
var_bin_specs
(c, idx=0)¶ Determine bin_specs to use for variable c.
Parameters: - c (list) – list of variables, or string variable
- idx (int) – index of the variable in c, for which to return the bin specs. default is 0.
Returns: selected bin_specs of variable
-
-
eskapade.analysis.histogram_filling.
only_bool
(val)¶ Pass input value or array only if it is a bool.
Parameters: val – value to be evaluated Returns: evaluated value Return type: np.bool or np.ndarray
-
eskapade.analysis.histogram_filling.
only_float
(val)¶ Pass input val value or array only if it is a float.
Parameters: val – value to be evaluated Returns: evaluated value Return type: np.float64 or np.ndarray
-
eskapade.analysis.histogram_filling.
only_int
(val)¶ Pass input val value or array only if it is an integer.
Parameters: val – value to be evaluated Returns: evaluated value Return type: np.int64 or np.ndarray
-
eskapade.analysis.histogram_filling.
only_str
(val)¶ Pass input value or array only if it is a string.
Parameters: val – value to be evaluated Returns: evaluated value Return type: str or np.ndarray
-
eskapade.analysis.histogram_filling.
to_ns
(x)¶ Convert input timestamps to nanoseconds (integers).
Parameters: x – value to be converted Returns: converted value Return type: int
-
eskapade.analysis.histogram_filling.
to_str
(val)¶ Convert input to (array of) string(s).
Parameters: val – value to be converted Returns: converted value Return type: str or np.ndarray
-
eskapade.analysis.histogram_filling.
value_to_bin_center
(val, **kwargs)¶ Convert value to bin center.
Convert a numeric or timestamp column to a common bin center value.
Parameters: - bin_width – bin_width value needed to convert column to a common bin center value
- bin_offset – bin_offset value needed to convert column to a common bin center value
-
eskapade.analysis.histogram_filling.
value_to_bin_index
(val, **kwargs)¶ Convert value to bin index.
Convert a numeric or timestamp column to an integer bin index.
Parameters: - bin_width – bin_width value needed to convert column to an integer bin index
- bin_offset – bin_offset value needed to convert column to an integer bin index
eskapade.analysis.statistics module¶
Project: Eskapade - A python-based package for data analysis.
Classes: ArrayStats, GroupByStats
Created: 2017/03/21
- Description:
- Summary of an array.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.statistics.
ArrayStats
(data, col_name, weights=None, unit='', label='')¶ Bases:
object
Create summary of an array.
Class to calculate statistics (mean, standard deviation, percentiles, etc.) and create a histogram of values in an array. The statistics can be returned as values in a dictionary, a printable string, or as a LaTeX string.
-
__init__
(data, col_name, weights=None, unit='', label='')¶ Initialize for a single column in data frame.
Parameters: - data ((keys of) dict) – Input array
- col_name – column name
- weights (string (column of data)) – Input array (default None)
- unit – Unit of column
- label (str) – Label to describe column variable
Raises: TypeError
-
create_mpv_stat
()¶ Compute most probable value from histogram.
This function computes the most probable value based on the histogram from make_histogram(), and adds it to the statistics.
-
create_stats
()¶ Compute statistical properties of column variable.
This function computes the statistical properties of values in the specified column. It is called by other functions that use the resulting figures to create a statistical overview.
-
get_col_props
()¶ Get column properties.
Returns dict: Column properties
-
get_latex_table
(get_stats=None, latex=True)¶ Get LaTeX code string for table of stats values.
Parameters: - get_stats (list) – List of statistics that you want to filter on. (default None (all stats)) Available stats are: ‘count’, ‘filled’, ‘distinct’, ‘mean’, ‘std’, ‘min’, ‘max’, ‘p05’, ‘p16’, ‘p50’, ‘p84’, ‘p95’, ‘p99’
- latex (bool) – LaTeX output or list output (default True)
Returns str: LaTeX code snippet
-
get_print_stats
(to_output=False)¶ Get statistics in printable form.
Parameters: to_output (bool) – Print statistics to output stream? Returns str: Printable statistics string
-
get_x_label
()¶ Get x label.
-
logger
¶ A logger that emits log messages to an observer.
The logger can be instantiated as a module or class attribute, e.g.
>>> logger = Logger() >>> logger.info("I'm a module logger attribute.") >>> >>> class Point(object): >>> logger = Logger() >>> >>> def __init__(self, x = 0.0, y = 0.0): >>> Point.logger.debug('Initializing {point} with x = {x} y = {y}', point=Point, x=x, y=y) >>> self._x = x >>> self._y = y >>> >>> @property >>> def x(self): >>> self.logger.debug('Getting property x = {point._x}', point=self) >>> return self._x >>> >>> @x.setter >>> def x(self, x): >>> self.logger.debug('Setting property y = {point._x}', point=self) >>> self._x = x >>> >>> @property >>> def y(self): >>> self.logger.debug('Getting property y = {point._y}', point=self) >>> return self._y >>> >>> @y.setter >>> def y(self, y): >>> self.logger.debug('Setting property y = {point._y}', point=self) >>> self._y = y >>> >>> a_point = Point(1, 2) >>> >>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point) >>> logger.log_level = LogLevel.DEBUG >>> logger.info('p_x = {point.x} p_y = {point.y}', point=a_point)
The logger uses PEP-3101 (Advanced String Formatting) with named placeholders, see <https://www.python.org/dev/peps/pep-3101/> and <https://pyformat.info/> for more details and examples.
Furthermore, logging events are only formatted and evaluated for logging levels that are enabled. So, there’s no need to check the logging level before logging. It’s also efficient.
-
make_histogram
(var_bins=30, var_range=None, bin_edges=None, create_mpv_stat=True)¶ Create histogram of column values.
Parameters: - var_bins (int) – Number of histogram bins
- var_range (tuple) – Range of histogram variable
- bin_edges (list) – predefined bin edges to use for histogram. Overrules var_bins.
-
-
class
eskapade.analysis.statistics.
GroupByStats
(data, col_name, groupby=None, weights=None, unit='', label='')¶ Bases:
eskapade.analysis.statistics.ArrayStats
Create summary of an array in groups.
-
__init__
(data, col_name, groupby=None, weights=None, unit='', label='')¶ Initialize for a single column in dataframe.
Parameters: - data ((keys of) dict) – Input array
- col_name – column name
- weights (string (column of data)) – Input array (default None)
- unit – Unit of column
- label (str) – Label to describe column variable
- groupby – column name
Raises: TypeError
-
get_latex_table
(get_stats=None)¶ Get LaTeX code string for group-by table of stats values.
Parameters: get_stats (list) – same as ArrayStats.get_latex_table get_stats key word. Returns str: LaTeX code snippet
-
-
eskapade.analysis.statistics.
get_col_props
(var_type)¶ Get column properties.
Returns dict: Column properties
-
eskapade.analysis.statistics.
weighted_quantile
(data, weights=None, probability=0.5)¶ Compute the weighted quantile of a 1D numpy array.
Weighted quantiles, inspired by: https://github.com/nudomarinero/wquantiles/blob/master/wquantiles.py written by Jose Sabater Here updated to return multiple quantiles in one go. Now also works when weight is None.
Parameters: - data (ndarray) – input array (one dimension).
- weights (ndarray) – array with the weights of the same size of data.
- probability (ndarray) – array of quantiles to compute. Each probablity must have a value between 0 and 1.
Returns: list of the output value(s).