eskapade.data_quality.links package¶

Submodules¶

eskapade.data_quality.links.fix_pandas_dataframe module¶

Project: Eskapade - A python-based package for data analysis.

Class: FixPandasDataFrame

Created: 2017/04/07

Description:: Link for fixing dirty pandas dataframe with inconsistent datatypes See example in: tutorials/esk501_fix_pandas_dataframe.py
Authors:: KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.data_quality.links.fix_pandas_dataframe.FixPandasDataFrame(**kwargs)¶

Bases: escore.core.element.Link

Fix dirty Pandas dataframe with inconsistent datatypes.

Default settings perform the following clean-up steps on an input dataframe:

Fix all column names. E.g. remove punctuation and strange characters, and convert spaces to underscores.
Check for various possible nans in the dataset, then make all nans consistent by turning them into numpy.nan (= float)
Per column, assess dynamically the most consistent datatype (ignoring all nans in that column). E.g. bool, int, float, datetime64, string.
Per column, make the data types of all rows consistent, by using the identified (or imposed) data type (by default ignoring all nans)

Boolean columns with contamination get converted to string columns by default. Optionally, they can be converted to integer columns as well.

The FixPandasDataFrame link can be used in a dataframe loop, in which case any data type assessed per column in the first dataframe iteration will be used for the next dataframes as well.

The default settings should work pretty well in many circumstances, by can be configured pretty flexibly. Optionally:

Instead of dynamically assessed, the data type can also be imposed per column
All nans in a column can be converted to a value consistent with the data type of that column. E.g. for integer columns, nan -> -999
An alternative nan can be set per column and datatype
Modifications can be applied inplace, i.e. directly to the input dataframe

__init__(**kwargs)¶

Initialize link instance.

Parameters:

name (str) – name of link
read_key (str) – key of input data to read from data store
copy_columns_from_df (bool) – if true, copy all columns from the dataframe (default is true)
original_columns (list) – original (unfixed) column names to pick up from input data (required if copy_columns_from_df is set to false)
contaminated_columns (list) – (original) columns that are known to have mistakes and that should be fixed (optional)
fix_column_names (bool) – if true, fix column names (default is true)
strip_hive_prefix (bool) – if true, strip table-name (hive) prefix from column names, e.g. table.bla -> bla (default is false)
convert_inconsistent_dtypes (bool) – fix column datatypes in case of data type inconsistencies in rows (default is true)
var_dtype (dict) – dict forcing columns to certain datatypes, e.g. {‘A’: int} (optional)
var_convert_inconsistent_dtypes (dict) – dict allowing one to overwrite if certain columns datatypes should be fixed, e.g. {‘A’: False} (optional)
var_convert_func (dict) – dict with datatype conversion functions for certain columns
check_nan_func – boolean return function to check for nans in columns. (default is None, in which case a standard checker function gets picked up)
convert_inconsistent_nans (bool) – if true, convert all nans to data type consistent with rest of column (default is false)
var_convert_inconsistent_nans (dict) – dict allowing one to overwrite if certain column nans should be fixed, e.g. {‘A’: False} (optional)
var_nan (dict) – dict with nans for certain columns (optional)
nan_dtype_map (dict) – dictionary of nans for given data types, e.g. { int: -999 }
nan_default – default nan value to which all nans found get converted (default is numpy.nan)
var_bool_to_int (list) – convert boolean column to int (default is conversion of boolean to string)
inplace (bool) – replace original columns; overwrites store_key to read_key (default is False)
store_key (str) – key of output data to store in data store
drop_dup_rec (bool) – if true, drop duplicate records from data frame after other fixes (default is false)
strip_string_columns (bool) – if true, apply strip command to string columns (default is true)
cleanup_string_columns (list) – boolean or list. apply cleaning-up to list of selected or all string columns. More aggressive than strip. Default is empty (= false).

execute()¶

Execute the link.

Fixing the Pandas dataframe consists of four steps:

Fix all column names. E.g. remove punctuation and strange characters, and convert spaces to underscores.
Check existing nans in that dataset, and make all nans consistent, for easy conversion later on.
Assess most consistent datatype for each column (ignoring all nans)
Make data types in each row consistent (by default ignoring all nans)

initialize()¶: Initialize the link.

eskapade.data_quality.links.fix_pandas_dataframe.determine_preferred_dtype(dtype_cnt)¶: Determine preferred column data type.

Module contents¶

class eskapade.data_quality.links.FixPandasDataFrame(**kwargs)¶