eskapade.visualization package

Submodules

eskapade.visualization.vis_utils module

Project: Eskapade - A python-based package for data analysis.

Created: 2017/02/28

Description:
Utility functions to collect Eskapade python modules e.g. functions to get correct Eskapade file paths and env variables
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

eskapade.visualization.vis_utils.box_plot(df, cause_col, result_col='cost', pdf_file_name='', ylim_quant=0.95, ylim_high=None, ylim_low=0, rot=90, statlim=400, label_dict=None, title_add='', top=20)

Make box plot.

Function that plots the boxplot of the column df[result_col] in groups of cause_col. This means that the DataFrame is grouped-by on the cause column and then the distribution per group is plotted in a boxplot using the standard pandas functionality. Boxplots with less than statlim (default=400 ) entries in it are automatically removed.

Parameters:
  • df – pandas DataFrame
  • cause_col (str) – name of the column to group on. This can technically be a number, but that is uncommon.
  • result_col (str) – column to do the boxplot on
  • pdf_file_name (str) – if set, will store the plot in a pdf file
  • ylim_quant (float) – the quantile of the y upper limit
  • ylim_high (float) – when defined, this limit is used, when not defined, defaults to None and ylim_high is determined by ylim_quant
  • ylim_low (float) – matplotlib set_ylim lower bound
  • rot (int) – matplotlib rot
  • statlim (int) – the number of entries that a group is required to have in order to be plotted
  • label_dict (dict) – dictionary with labels for the columns, usage example: label_dict={‘col_x’: ‘Time’}
  • title_add (str) – string that is added to the automatic title (the y column name)
  • top (int) – only print the top 20 characters of x-labels and y-labels. (default is 20)
eskapade.visualization.vis_utils.delete_smallstat(df, group_col, statlim=400)

Remove low-statistics groups from dataframe.

Function to make a new DataFrame that removes all groups of group_col that have less than statlim entries.

Parameters:
  • df – pandas DataFrame
  • group_col (str) – name of the column to group on
  • statlim (int) – number of entries a group has to have to be statistically significant
Returns:

smaller DataFrame and the number of removed categories

Return type:

tuple

eskapade.visualization.vis_utils.plot_2d_histogram(hist, x_lim, y_lim, title, x_label, y_label, pdf_file_name)

Plot 2d histogram with matplotlib.

Parameters:
  • hist – input numpy histogram = x_bin_edges, y_bin_edges, bin_entries_2dgrid
  • x_lim (tuple) – range tuple of x-axis (min,max)
  • y_lim (tuple) – range tuple of y-axis (min,max)
  • title (str) – title of plot
  • x_label (str) – Label for histogram x-axis
  • y_label (str) – Label for histogram y-axis
  • pdf_file_name (str) – if set, will store the plot in a pdf file
eskapade.visualization.vis_utils.plot_correlation_matrix(matrix_colors, x_labels, y_labels, pdf_file_name='', title='correlation', vmin=-1, vmax=1, color_map='RdYlGn', x_label='', y_label='', top=20, matrix_numbers=None, print_both_numbers=True)

Create and plot correlation matrix.

Parameters:
  • matrix_colors – input correlation matrix
  • x_labels (list) – Labels for histogram x-axis bins
  • y_labels (list) – Labels for histogram y-axis bins
  • pdf_file_name (str) – if set, will store the plot in a pdf file
  • title (str) – if set, title of the plot
  • vmin (float) – minimum value of color legend (default is -1)
  • vmax (float) – maximum value of color legend (default is +1)
  • x_label (str) – Label for histogram x-axis
  • y_label (str) – Label for histogram y-axis
  • color_map (str) – color map passed to matplotlib pcolormesh. (default is ‘RdYlGn’)
  • top (int) – only print the top 20 characters of x-labels and y-labels. (default is 20)
  • matrix_numbers – input matrix used for plotting numbers. (default it matrix_colors)
eskapade.visualization.vis_utils.plot_histogram(hist, x_label, y_label=None, is_num=True, is_ts=False, pdf_file_name='', top=20)

Create and plot histogram of column values.

Parameters:
  • hist – input numpy histogram = values, bin_edges
  • x_label (str) – Label for histogram x-axis
  • y_label (str) – Label for histogram y-axis
  • is_num (bool) – True if observable to plot is numeric
  • is_ts (bool) – True if observable to plot is a timestamp
  • pdf_file_name (str) – if set, will store the plot in a pdf file
  • top (int) – only print the top 20 characters of x-labels and y-labels. (default is 20)

Module contents