eskapade.visualization.links package¶

Submodules¶

eskapade.visualization.links.correlation_summary module¶

Project: Eskapade - A python-based package for data analysis.

Class : correlation_summary

Created: 2017/03/13

Description:: Algorithm to do create correlation heatmaps.
Authors:: KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.visualization.links.correlation_summary.CorrelationSummary(**kwargs)¶

Bases: eskapade.core.element.Link

Create a heatmap of correlations between dataframe variables.

__init__(**kwargs)¶

Initialize link instance.

Parameters:	name (str) – name of link read_key (str) – key of input dataframe to read from data store store_key (str) – key of correlations dataframe in data store results_path (str) – path to save correlation summary pdf methods (list) – method(s) of computing correlations pages_key (str) – data store key of existing report pages

execute()¶: Execute the link.

finalize()¶: Finalize the link.

initialize()¶: Initialize the link.

eskapade.visualization.links.df_boxplot module¶

Project: Eskapade - A python-based package for data analysis.

Class : DfBoxplot

Created: 2017/02/17

Description:: Link to create a boxplot of data frame columns.
Authors:: KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.visualization.links.df_boxplot.DfBoxplot(**kwargs)¶

Bases: eskapade.core.element.Link

Create a boxplot of one column of a DataFrame that is grouped by values from a second column.

Creates a report page for each variable in DataFrame, containing:

a profile of the column dataset
a nicely scaled plot of the boxplots per group of the column

Example is available in: tutorials/esk304_df_boxplot.py

__init__(**kwargs)¶

Initialize link instance.

Parameters:

name (str) – name of link
read_key (str) – key of input data to read from data store
results_path (str) – output path of summary result files
column (str) – column pick up from input data to use as boxplot input
cause_columns (list) – list of columns (str) to group-by, and per unique value plot a boxplot
statistics (list) – a list of strings of the statistics you want to generate for the boxplot the full list is taken from statistics.ArrayStats.get_latex_table defaults to: [‘count’, ‘mean’, ‘min’, ‘max’]
pages_key (str) – data store key of existing report pages

execute()¶

Execute the link.

Creates a report page for each column that we group-by in the data frame.

create statistics object for group
create overview table of column variable
plot boxplot of column variable per group
store plot

finalize()¶: Finalize the link.

initialize()¶: Initialize the link.

eskapade.visualization.links.df_summary module¶

Project: Eskapade - A python-based package for data analysis.

Class : DfSummary

Created: 2017/02/17

Description:: Link to create a statistics summary of data frame columns or of a set of histograms.
Authors:: KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.visualization.links.df_summary.DfSummary(**kwargs)¶

Bases: eskapade.core.element.Link

Create a summary of a dataframe.

Creates a report page for each variable in data frame, containing:

a profile of the column dataset
a nicely scaled plot of the column dataset

Example 1 is available in: tutorials/esk301_dfsummary_plotter.py

Example 2 is available in: tutorials/esk303_histogram_filling_plotting.py Empty histograms are automatically skipped from processing.

__init__(**kwargs)¶

Initialize link instance.

Parameters:

name (str) – name of link
read_key (str) – key of input dataframe (or histogram-dict) to read from data store
results_path (str) – output path of summary result files
columns (list) – columns (or histogram keys) pick up from input data to make & plot summaries for
hist_keys (list) – alternative to columns (optional)
var_labels (dict) – dict of column names with a label per column
var_units (dict) – dict of column names with a unit per column
var_bins (dict) – dict of column names with the number of bins per column. Default per column is 30.
hist_y_label (str) – y-axis label to plot for all columns. Default is ‘Bin Counts’.
pages_key (str) – data store key of existing report pages

assert_data_type(data)¶

Check type of input data.

Parameters:	data – input data sample (pandas dataframe or dict)

execute()¶

Execute the link.

Creates a report page for each variable in data frame.

create statistics object for column
create overview table of column variable
plot histogram of column variable
store plot

Returns:	execution status code
Return type:	StatusCode

finalize()¶: Finalize the link.

get_all_columns(data)¶

Retrieve all columns / keys from input data.

Parameters:	data – input data sample (pandas dataframe or dict)
Returns:	list of columns
Return type:	list

get_length(data)¶

Get length of data set.

Parameters:	data – input data (pandas dataframe or dict)
Returns:	length of data set

get_sample(data, key)¶

Retrieve speficic column or item from input data.

Parameters:	data – input data (pandas dataframe or dict) key (str) – column key
Returns:	data series or item

initialize()¶: Initialize the link.

process_1d_histogram(name, hist)¶

Create statistics of and plot input 1d histogram.

Parameters:	name (str) – name of the histogram hist – input histogram object

process_2d_histogram(name, hist)¶

Create statistics of and plot input 2d histogram.

Parameters:	name (str) – name of the histogram hist – input histogram object

process_nan_histogram(nphist, n_data)¶

Process nans histogram.

Add nans histogram to pdf list

Parameters:	nphist – numpy-style input histogram, consisting of comma-separaged bin_entries, bin_edges n_data (int) – number of entries in the processed data set

process_sample(name, sample)¶

Process various possible data samples.

Parameters:	name (str) – name of sample sample – input pandas series object or histogram

process_series(col, sample)¶

Create statistics of and plot input pandas series.

Parameters:	col (str) – name of the series sample – input pandas series object

Module contents¶

class eskapade.visualization.links.CorrelationSummary(**kwargs)¶

Bases: eskapade.core.element.Link

Create a heatmap of correlations between dataframe variables.

__init__(**kwargs)¶

Initialize link instance.

Parameters:	name (str) – name of link read_key (str) – key of input dataframe to read from data store store_key (str) – key of correlations dataframe in data store results_path (str) – path to save correlation summary pdf methods (list) – method(s) of computing correlations pages_key (str) – data store key of existing report pages

execute()¶: Execute the link.

finalize()¶: Finalize the link.

initialize()¶: Initialize the link.

class eskapade.visualization.links.DfBoxplot(**kwargs)¶

Bases: eskapade.core.element.Link

Create a boxplot of one column of a DataFrame that is grouped by values from a second column.

Creates a report page for each variable in DataFrame, containing:

a profile of the column dataset
a nicely scaled plot of the boxplots per group of the column

Example is available in: tutorials/esk304_df_boxplot.py

__init__(**kwargs)¶

Initialize link instance.

Parameters:

name (str) – name of link
read_key (str) – key of input data to read from data store
results_path (str) – output path of summary result files
column (str) – column pick up from input data to use as boxplot input
cause_columns (list) – list of columns (str) to group-by, and per unique value plot a boxplot
statistics (list) – a list of strings of the statistics you want to generate for the boxplot the full list is taken from statistics.ArrayStats.get_latex_table defaults to: [‘count’, ‘mean’, ‘min’, ‘max’]
pages_key (str) – data store key of existing report pages

execute()¶

Execute the link.

Creates a report page for each column that we group-by in the data frame.

create statistics object for group
create overview table of column variable
plot boxplot of column variable per group
store plot

finalize()¶: Finalize the link.

initialize()¶: Initialize the link.

class eskapade.visualization.links.DfSummary(**kwargs)¶

Bases: eskapade.core.element.Link

Create a summary of a dataframe.

Creates a report page for each variable in data frame, containing:

a profile of the column dataset
a nicely scaled plot of the column dataset

Example 1 is available in: tutorials/esk301_dfsummary_plotter.py

Example 2 is available in: tutorials/esk303_histogram_filling_plotting.py Empty histograms are automatically skipped from processing.

__init__(**kwargs)¶

Initialize link instance.

Parameters:

name (str) – name of link
read_key (str) – key of input dataframe (or histogram-dict) to read from data store
results_path (str) – output path of summary result files
columns (list) – columns (or histogram keys) pick up from input data to make & plot summaries for
hist_keys (list) – alternative to columns (optional)
var_labels (dict) – dict of column names with a label per column
var_units (dict) – dict of column names with a unit per column
var_bins (dict) – dict of column names with the number of bins per column. Default per column is 30.
hist_y_label (str) – y-axis label to plot for all columns. Default is ‘Bin Counts’.
pages_key (str) – data store key of existing report pages

assert_data_type(data)¶

Check type of input data.

Parameters:	data – input data sample (pandas dataframe or dict)

execute()¶

Execute the link.

Creates a report page for each variable in data frame.

create statistics object for column
create overview table of column variable
plot histogram of column variable
store plot

Returns:	execution status code
Return type:	StatusCode

finalize()¶: Finalize the link.

get_all_columns(data)¶

Retrieve all columns / keys from input data.

Parameters:	data – input data sample (pandas dataframe or dict)
Returns:	list of columns
Return type:	list

get_length(data)¶

Get length of data set.

Parameters:	data – input data (pandas dataframe or dict)
Returns:	length of data set

get_sample(data, key)¶

Retrieve speficic column or item from input data.

Parameters:	data – input data (pandas dataframe or dict) key (str) – column key
Returns:	data series or item

initialize()¶: Initialize the link.

process_1d_histogram(name, hist)¶

Create statistics of and plot input 1d histogram.

Parameters:	name (str) – name of the histogram hist – input histogram object

process_2d_histogram(name, hist)¶

Create statistics of and plot input 2d histogram.

Parameters:	name (str) – name of the histogram hist – input histogram object

process_nan_histogram(nphist, n_data)¶

Process nans histogram.

Add nans histogram to pdf list

Parameters:	nphist – numpy-style input histogram, consisting of comma-separaged bin_entries, bin_edges n_data (int) – number of entries in the processed data set

process_sample(name, sample)¶

Process various possible data samples.

Parameters:	name (str) – name of sample sample – input pandas series object or histogram

process_series(col, sample)¶

Create statistics of and plot input pandas series.

Parameters:	col (str) – name of the series sample – input pandas series object