Eskapade
latest
  • Introduction
  • Installation
  • Tutorials
  • Command Line Arguments
  • Package structure
  • Release notes
  • Developing and Contributing
  • References
  • API Documentation
    • Eskapade
      • eskapade package
        • Subpackages
        • Submodules
        • eskapade.entry_points module
        • eskapade.exceptions module
        • eskapade.helpers module
        • eskapade.resources module
        • eskapade.utils module
        • eskapade.version module
        • Module contents
  • Miscellaneous
Eskapade
  • Docs »
  • API Documentation »
  • Eskapade »
  • eskapade package »
  • eskapade.analysis package »
  • eskapade.analysis.links package
  • Edit on GitHub

eskapade.analysis.links package¶

Submodules¶

eskapade.analysis.links.apply_func_to_df module¶

Project: Eskapade - A python-based package for data analysis.

Class: ApplyFuncToDf

Created: 2016/11/08

Description:
Algorithm to apply one or more functions to a (grouped) dataframe column or to an entire dataframe.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.apply_func_to_df.ApplyFuncToDf(**kwargs)¶

Bases: escore.core.element.Link

Apply functions to data-frame.

Applies one or more functions to a (grouped) dataframe column or an entire dataframe. In the latter case, this can be done row wise or column wise. The input dataframe will be overwritten.

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • read_key (str) – data-store input key
  • store_key (str) – data-store output key
  • apply_funcs (list) – functions to apply (list of dicts) - ‘func’: function to apply - ‘colout’ (string): output column - ‘colin’ (string, optional): input column - ‘entire’ (boolean, optional): apply to the entire dataframe? - ‘args’ (tuple, optional): args for ‘func’ - ‘kwargs’ (dict, optional): kwargs for ‘func’ - ‘groupby’ (list, optional): column names to group by - ‘groupbyColout’ (string) output column after the split-apply-combine combination
  • add_columns (dict) – columns to add to output (name, column)
add_apply_func(func, out_column, in_column='', *args, **kwargs)¶

Add function to be applied to dataframe.

execute()¶

Execute the link.

groupbyapply(df, groupby_columns, applyfunc, *args, **kwargs)¶

Apply groupby to dataframe.

initialize()¶

Initialize the link.

eskapade.analysis.links.apply_selection_to_df module¶

Project: Eskapade - A python-based package for data analysis.

Class: ApplySelectionToDf

Created: 2016/11/08

Description:
Algorithm to apply queries to input dataframe
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.apply_selection_to_df.ApplySelectionToDf(**kwargs)¶

Bases: escore.core.element.Link

Applies queries with sub-selections to a pandas dataframe.

__init__(**kwargs)¶

Initialize link instance.

Input dataframe is not overwritten, unless instructed to do so in kwargs.

Parameters:
  • name (str) – name of link
  • read_key (str) – key of data to read from data store
  • store_key (str) – key of data to store in data store. If not set read_key is overwritten.
  • query_set (list) – list of strings, query expressions to evaluate in the same order, see pandas documentation
  • select_columns (list) – column names to select after querying
  • continue_if_failure (bool) – if True continues with next query after failure (optional)
  • kwargs – all other key word arguments are passed on to the pandas queries.
execute()¶

Execute the link.

Applies queries or column selection to a pandas DataFrame. Input dataframe is not overwritten, unless told to do so in kwargs.

  1. Apply queries, in order of provided query list.
  2. Select columns (if provided).
initialize()¶

Initialize the link.

Perform checks on provided attributes.

eskapade.analysis.links.basic_generator module¶

Project: Eskapade - A python-based package for data analysis.

Class: BasicGenerator

Created: 2017/02/26

Description:
Link to generate random data with basic distributions
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.basic_generator.BasicGenerator(**kwargs)¶

Bases: escore.core.element.Link

Generate data with basic distributions.

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • key (str) – key of output data in data store
  • columns (list) – output column names
  • size (int) – number of variable values
  • gen_config (dict) – generator configuration for each variable
  • gen_seed (int) – generator random seed
execute()¶

Execute the link.

initialize()¶

Initialize the link.

eskapade.analysis.links.df_concatenator module¶

Project: Eskapade - A python-based package for data analysis.

Class: DataFrameColumnRenamer

Created: 2016/11/08

Description:
Algorithm to concatenate multiple pandas datadrames
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.df_concatenator.DfConcatenator(**kwargs)¶

Bases: escore.core.element.Link

Concatenates multiple pandas datadrames.

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • name (str) – name of link
  • store_key (str) – key of data to store in data store
  • read_keys (list) – keys of pandas dataframes in the data store
  • ignore_missing_input (bool) – Skip missing input datasets. If all missing, store empty dataset. Default is false.
  • kwargs – all other key word arguments are passed on to pandas concat function.
execute()¶

Execute the link.

Perform concatenation of multiple pandas datadrames.

initialize()¶

Initialize the link.

eskapade.analysis.links.df_merger module¶

Project: Eskapade - A python-based package for data analysis.

Class: DfMerger

Created: 2016/11/08

Description:
Algorithm to Merges two pandas DataFrames
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.df_merger.DfMerger(**kwargs)¶

Bases: escore.core.element.Link

Merges two pandas dataframes.

__init__(**kwargs)¶

Initialize link instance.

Store the configuration of the link.

Parameters:
  • name (str) – name of link
  • input_collection1 (str) – datastore key of the first pandas.DataFrame to merge
  • input_collection2 (str) – datastore key of the second pandas.DataFrame to merge
  • output_collection (str) – datastore key of the merged output pandas.DataFrame
  • how (str) – merge modus. See pandas documentation.
  • on (list) – column names. See pandas documentation.
  • columns1 (list) – column names of the first pandas.DataFrame. Only these columns are included in the merge. If not set, use all columns.
  • columns2 (list) – column names of the second pandas.DataFrame. Only these columns are included in the merge. If not set, use all columns.
  • remove_duplicate_cols2 (bool) – if True duplicate columns will be taken out before the merge (default=True)
  • kwargs – all other key word arguments are passed on to the pandas merge function.
execute()¶

Perform merging of input dataframes.

initialize()¶

Perform basic checks on provided attributes.

eskapade.analysis.links.histogrammar_filler module¶

eskapade.analysis.links.random_sample_splitter module¶

Project: Eskapade - A python-based package for data analysis.

Class: RandomSampleSplitter

Created: 2016/11/08

Description:
Algorithm to randomly assign records to a number of classes
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.random_sample_splitter.RandomSampleSplitter(**kwargs)¶

Bases: escore.core.element.Link

Link that randomly assigns records of an input dataframe to a number of classes.

After assigning classes does one of the following:

  • splits the input dataframe into sub dataframes according classes and stores the sub dataframes into the datastore;
  • add a new column with assigned classes to the dataframe.

Records are assigned randomly.

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • name (str) – name of link
  • read_key (str) – key of data to read from datastore
  • store_key (list) – keys of datasets to store in datastore. Number of sub samples equals length of store_key list (optional instead of ‘column’ and ‘nclasses’).
  • column (str) – name of new column that specifies the randomly assigned class. Default is randomclass (optional instead of ‘store_key’).
  • nclasses (int) – number of random classes. Needs to be set (optional instead of ‘store_key’).
  • fractions (list) – list of fractions (0<fraction<1) of records assigned to the sub samples. Can be one less than n classes. Sum can be less than 1. Needs to be set.
  • nevents (list) – list of number of random records assigned to the sub samples Can be one less than n classes (optional instead of ‘fractions’).
execute()¶

Execute the link.

initialize()¶

Check and initialize attributes of the link.

eskapade.analysis.links.read_to_df module¶

Project: Eskapade - A python-based package for data analysis.

Class: ReadToDf

Created: 2016/11/08

Description:
Algorithm to write pandas dataframes picked up from the datastore.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.read_to_df.ReadToDf(**kwargs)¶

Bases: escore.core.element.Link

Reads input file(s) to a pandas dataframe.

You give the link a path where your file is located and some kwargs that go into a pandas DataFrame. The kwargs are passed into the file reader.

__init__(**kwargs)¶

Initialize link instance.

Store the configuration of link ReadToDf.

Parameters:
  • name (str) – Name given to the link
  • path (str) – path of your file to read into pandas DataFrame .
  • key (str) – storage key for the DataStore.
  • reader – reader is determined automatically. But can be set by hand, e.g. csv, xlsx. To use the numpy reader one of the following should be true:
  • reader is {‘numpy’, ‘np’, ‘npy’, ‘npz’}
  • path contains extensions {‘npy’, ‘npz’}
  • param file_type is {‘npy’, ‘npz’}

To use the feather reader one of the following should be true:

  • reader is {‘feather’, ‘ft’}
  • path contains extensions ‘ft’

When to use feather or which numpy type see the esk210_dataframe_restoration tutorial :param bool restore_index: whether to store the index in the metadata. Default is False when the index is numeric, True otherwise. :param str file_type: {‘npy’, ‘npz’} when using the numpy reader Optional, see reader for details. :param bool itr_over_files: Iterate over individual files, default is false. If false, are files are collected in one dataframe. NB chunksize takes priority! :param int chunksize: Default is none. If positive integer then will always iterate. chunksize requires pd.read_csv or pd.read_table. :param int n_files_in_fork: number of files to process if forked. Default is 1. :param kwargs: all other key word arguments are passed on to the pandas reader.

config_lock¶

Get lock status of configuration

Default lock status is False.

Returns:lock status of configuration
Return type:bool
configure_paths(lock: bool = False) → None¶

Configure paths used during exectute

This is the final part of initialization, and needs to be redone in case of forked processing. Hence this function is split off into a separate function. The function can be locked once the configuration is final.

Parameters:lock (bool) – if True, lock this part of the configuration
execute()¶

Execute the link.

Reads the input file(s) and puts the dataframe in the datastore.

initialize()¶

Initialize the link.

is_finished() → bool¶

Try to assess if looper is done iterating over files.

Assess if looper is done or if a next dataset is still coming up.

latest_data_length()¶

Return length of current dataset.

set_chunk_size(size)¶

Set chunksize setting.

Parameters:size – chunk size
sum_data_length()¶

Return sum length of all datasets processed sofar.

eskapade.analysis.links.read_to_df.feather_reader(path, restore_index)¶

Read from feather file from disk to DataFrame, restoring the metadata

Parameters:
  • path (str) – target file location
  • restore_index (bool) – store index in DataFrame Default is True
Returns df:

the DF read from disk

Return type:

pd.DataFrame

eskapade.analysis.links.read_to_df.numpy_reader(path, restore_index, file_type)¶

Read from numpy file from disk to DataFrame, restoring the metadata

Parameters:
  • path (str) – target file location
  • restore_index (bool) – store index in DataFrame Default is True
  • file_type (str) – the file type used {‘npy’, ‘npz’}
Raises:
  • AmbiguousFileType – when we can’t determine whether the file type is npy or npz
  • UnhandledFileType – generic catch for when the type logic fails to exclude case
Returns df:

the DF read from disk

Return type:

pd.DataFrame

eskapade.analysis.links.read_to_df.set_reader(path, reader, *args, **kwargs)¶

Pick the correct reader.

Based on provided reader setting, or based on file extension.

eskapade.analysis.links.record_factorizer module¶

Project: Eskapade - A python-based package for data analysis.

Class: RecordFactorizer

Created: 2016/11/08

Description:
Algorithm to perform the factorization of an input column of an input dataframe. E.g. a columnn x with values ‘apple’, ‘tree’, ‘pear’, ‘apple’, ‘pear’ is tranformed into columns x with values 0, 1, 2, 0, 2, etc.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.record_factorizer.RecordFactorizer(**kwargs)¶

Bases: escore.core.element.Link

Factorize data-frame columns.

Perform factorization of input column of an input dataframe. E.g. a columnn x with values ‘apple’, ‘tree’, ‘pear’, ‘apple’, ‘pear’ is tranformed into columns x with values 0, 1, 2, 0, 2, etc. Resulting dataset stored as new dataset. Alternatively, map transformed columns back to orginal format.

__init__(**kwargs)¶

Initialize link instance.

Store and do basic check on the attributes of link RecordFactorizer

Parameters:
  • read_key (str) – key to read dataframe from the data store. Dataframe of records that is to be transformed.
  • columns (list) – list of columns that are to be factorized
  • inplace (bool) – replace original columns. Default is False. Overwrites store_key to read_key.
  • convert_all_categories (bool) – if true, convert all catergory observables. Default is false.
  • convert_all_booleans (bool) – if true, convert all boolean observables. Default is false.
  • map_to_original (dict) – dictiorary or key to dictionary to map back factorized columns to original. map_to_original is a dict of dicts, one dict for each column.
  • store_key (str) – store key of output dataFrame. Default is read_key + ‘_fact’. (optional)
  • sk_map_to_original (str) – store key of dictiorary to map factorized columns to original. Default is ‘key’ + ‘_’ + store_key + ‘_to_original’. (optional)
  • sk_map_to_factorized (str) – store key of dictiorary to map original to factorized columns. Default is ‘key’ + ‘_’ + read_key + ‘_to_factorized’. (optional)
execute()¶

Execute the link.

Perform factorization input columns ‘columns’ of input dataframe. Resulting dataset stored as new dataset. Alternatively, map transformed columns back to orginal format.

initialize()¶

Initialize the link.

Initialize and (further) check the assigned attributes of the RecordFactorizer

eskapade.analysis.links.record_vectorizer module¶

Project: Eskapade - A python-based package for data analysis.

Class: RecordVectorizer

Created: 2016/11/08

Description:
Algorithm to perform the vectorization of an input column of an input dataframe.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.record_vectorizer.RecordVectorizer(**kwargs)¶

Bases: escore.core.element.Link

Vectorize data-frame columns.

Perform vectorization of input column of an input dataframe. E.g. a columnn x with values 1, 2 is tranformed into columns x_1 and x_2, with values True or False assigned per record.

__init__(**kwargs)¶

Initialize link instance.

Store and do basic check on the attributes of link RecordVectorizer.

Parameters:
  • read_key (str) – key to read dataframe from the data store. Dataframe of records that is to be transformed.
  • columns (list) – list of columns that are to be vectorized
  • store_key (str) – store key of output dataFrame. Default is read_key + ‘_vectorized’. (optional)
  • column_compare_with (dict) – dict of unique items per column with which column values are compared. If not given, this is derived automatically from the column. (optional)
  • astype (type) – store answer of comparison of column with value as certain type. Default is bool. (optional)
execute()¶

Execute the link.

Perform vectorization input column ‘column’ of input dataframe. Resulting dataset stored as new dataset.

initialize()¶

Initialize the link.

Initialize and (further) check the assigned attributes of RecordVectorizer.

eskapade.analysis.links.record_vectorizer.record_vectorizer(df, column_to_vectorize, column_compare_set, astype=<class 'bool'>)¶

Vectorize data-frame column.

Takes the new record that is already transformed and vectorizes the given columns.

Parameters:
  • df – dataframe of the new record to vectorize
  • column_to_vectorize (str) – string, column in the new record to vectorize.
  • column_compare_set (list) – list of values to compare the column with.
Returns:

dataframe of the new records.

eskapade.analysis.links.value_counter module¶

Project: Eskapade - A python-based package for data analysis.

Class: ValueCounter

Created: 2017/03/02

Description:
Algorithm to do value_counts() on single columns of a pandas dataframe, or groupby().size() on multiple columns, both returned as dictionaries. It is possible to do cleaning of these dicts by rejecting certain keys or removing inconsistent data types. Numeric and timestamp columns are converted to bin indices before the binning is applied. Results are stored as 1D Histograms or as ValueCounts objects.
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.value_counter.ValueCounter(**kwargs)¶

Bases: eskapade.analysis.histogram_filling.HistogramFillerBase

Count values in Pandas data frame.

ValueCounter does value_counts() on single columns of a pandas dataframe, or groupby().size() on multiple columns. Results of both are returned as same-style dictionaries.

Numeric and timestamp columns are converted to bin indices before the binning is applied. The binning can be provided as input.

It is possible to do cleaning of these dicts by rejecting certain keys or removing inconsistent data types. Results are stored as 1D Histograms or as ValueCounts objects.

Example is available in: tutorials/esk302_histogram_filling_plotting.py

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • name (str) – name of link
  • read_key (str) – key of input data to read from data store
  • store_key_counts (str) – key of output data to store ValueCounts objects in data store
  • store_key_hists (str) – key of output data to store histograms in data store
  • columns (list) – columns to pick up from input data (default is all columns)
  • bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns

Example bin_specs dictionary is:

>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0},
>>>              'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]},
>>>              'date': {'bin_width': np.timedelta64(30, 'D'),
>>>                       'bin_offset': np.datetime64('2010-01-04')}}
Parameters:
  • var_dtype (dict) – dict of datatypes of the columns to study from dataframe. If not provided, try to determine datatypes directy from dataframe.
  • store_at_finalize (bool) – Store histograms and/or ValueCount object in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
  • drop_inconsistent_key_types (bool) – cleanup histograms and/or ValueCount objects by removing alls bins/keys with inconsistent datatypes. By default compare with data types in var_dtype dictionary.
  • dict (drop_keys) – dictionary used for dropping specific keys from created value_counts dictionaries

Example drop_keys dictionary is:

>>> drop_keys = {'x': [1, 4, 8, 19],
>>>              'y': ['apple', 'pear', 'tomato'],
>>>              'x:y': [(1, 'apple'), (19, 'tomato')]}
drop_inconsistent_keys(columns, obj)¶

Drop inconsistent keys.

Drop inconsistent keys from a ValueCounts or Histogram object.

Parameters:
  • columns (list) – columns key to retrieve desired datatypes
  • obj (object) – ValueCounts or Histogram object to drop inconsistent keys from
fill_histogram(idf, columns)¶

Fill input histogram with column(s) of input dataframe.

Parameters:
  • idf – input data frame used for filling histogram
  • columns (list) – histogram column(s)
finalize()¶

Finalize ValueCounter.

initialize()¶

Initialize the link.

process_and_store()¶

Make, clean, and store ValueCount objects.

process_columns(df)¶

Process columns before histogram filling.

Specifically, convert timestamp columns to integers and numeric variables are converted to indices

Parameters:df – input (pandas) data frame
Returns:output (pandas) data frame with converted timestamp columns
Return type:pandas DataFrame

eskapade.analysis.links.write_from_df module¶

Project: Eskapade - A python-based package for data analysis.

Class: WriteFromDf

Created: 2016/11/08

Description:
Algorithm to write a DataFrame from the DataStore to disk
Authors:
KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

class eskapade.analysis.links.write_from_df.WriteFromDf(**kwargs)¶

Bases: escore.core.element.Link

Write a DataFrame from the DataStore to disk.

__init__(**kwargs)¶

Store the configuration of the link.

Parameters:
  • name (str) – Name given to the link
  • key (str) – the DataStore key
  • path (str) – path where to save the DataFrame
  • writer – file extension that can be written by a pandas writer function from pd.DataFrame, or the numpy- feather writers. For example: ‘csv’ will trigger the DataFrame.to_csv. To use numpy_writer specify one of the following:

{‘numpy’, ‘np’, ‘npy’, ‘npz’, }

To use feather specify: {‘feather’, ‘ft’} If writer is not passed the path must contain a known file extension. Valid numpy extensions {‘npy’, ‘npz’} or feather {‘ft’}

Note:

the numpy and feather writers will preserve the metadata such as dtypes for each column and the index if non numeric.

Parameters:
  • dictionary (dict) – keys (as in the arg above) and paths (as in the arg above) it will write out all the keys to the associated paths.
  • add_counter_to_name (bool) – if true, add an index to the output file name. Useful when running in loops. Default is false.
  • store_index (bool) – whether the index should be stored as metadata. Default is False unless the index is non-numeric
  • kwargs – all other key word arguments are passed on to the pandas writers.
execute()¶

Execute the link.

Pick up the dataframe and write to disk.

initialize()¶

Initialize the link.

eskapade.analysis.links.write_from_df.feather_writer(df, path, store_index)¶

Write df to disk in feather format; preserving the metadata

Parameters:
  • df (DataFrame) – pandas Dataframe to write out
  • path (str) – target file location
  • store_index (bool) – store index in DataFrame, default is True
eskapade.analysis.links.write_from_df.get_writer(path, writer, *args, **kwargs)¶

Pick the correct writer.

Based on provided writer setting, or based on file extension.

eskapade.analysis.links.write_from_df.numpy_writer(df, path, store_index)¶

Write df to disk in numpy format; preserving the metadata

Parameters:
  • df (DataFrame) – pandas Dataframe to write out
  • path (str) – target file location
  • store_index (bool) – store index in DataFrame

Module contents¶

class eskapade.analysis.links.ApplyFuncToDf(**kwargs)¶

Bases: escore.core.element.Link

Apply functions to data-frame.

Applies one or more functions to a (grouped) dataframe column or an entire dataframe. In the latter case, this can be done row wise or column wise. The input dataframe will be overwritten.

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • read_key (str) – data-store input key
  • store_key (str) – data-store output key
  • apply_funcs (list) – functions to apply (list of dicts) - ‘func’: function to apply - ‘colout’ (string): output column - ‘colin’ (string, optional): input column - ‘entire’ (boolean, optional): apply to the entire dataframe? - ‘args’ (tuple, optional): args for ‘func’ - ‘kwargs’ (dict, optional): kwargs for ‘func’ - ‘groupby’ (list, optional): column names to group by - ‘groupbyColout’ (string) output column after the split-apply-combine combination
  • add_columns (dict) – columns to add to output (name, column)
add_apply_func(func, out_column, in_column='', *args, **kwargs)¶

Add function to be applied to dataframe.

execute()¶

Execute the link.

groupbyapply(df, groupby_columns, applyfunc, *args, **kwargs)¶

Apply groupby to dataframe.

initialize()¶

Initialize the link.

class eskapade.analysis.links.ApplySelectionToDf(**kwargs)¶

Bases: escore.core.element.Link

Applies queries with sub-selections to a pandas dataframe.

__init__(**kwargs)¶

Initialize link instance.

Input dataframe is not overwritten, unless instructed to do so in kwargs.

Parameters:
  • name (str) – name of link
  • read_key (str) – key of data to read from data store
  • store_key (str) – key of data to store in data store. If not set read_key is overwritten.
  • query_set (list) – list of strings, query expressions to evaluate in the same order, see pandas documentation
  • select_columns (list) – column names to select after querying
  • continue_if_failure (bool) – if True continues with next query after failure (optional)
  • kwargs – all other key word arguments are passed on to the pandas queries.
execute()¶

Execute the link.

Applies queries or column selection to a pandas DataFrame. Input dataframe is not overwritten, unless told to do so in kwargs.

  1. Apply queries, in order of provided query list.
  2. Select columns (if provided).
initialize()¶

Initialize the link.

Perform checks on provided attributes.

class eskapade.analysis.links.BasicGenerator(**kwargs)¶

Bases: escore.core.element.Link

Generate data with basic distributions.

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • key (str) – key of output data in data store
  • columns (list) – output column names
  • size (int) – number of variable values
  • gen_config (dict) – generator configuration for each variable
  • gen_seed (int) – generator random seed
execute()¶

Execute the link.

initialize()¶

Initialize the link.

class eskapade.analysis.links.DfConcatenator(**kwargs)¶

Bases: escore.core.element.Link

Concatenates multiple pandas datadrames.

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • name (str) – name of link
  • store_key (str) – key of data to store in data store
  • read_keys (list) – keys of pandas dataframes in the data store
  • ignore_missing_input (bool) – Skip missing input datasets. If all missing, store empty dataset. Default is false.
  • kwargs – all other key word arguments are passed on to pandas concat function.
execute()¶

Execute the link.

Perform concatenation of multiple pandas datadrames.

initialize()¶

Initialize the link.

class eskapade.analysis.links.DfMerger(**kwargs)¶

Bases: escore.core.element.Link

Merges two pandas dataframes.

__init__(**kwargs)¶

Initialize link instance.

Store the configuration of the link.

Parameters:
  • name (str) – name of link
  • input_collection1 (str) – datastore key of the first pandas.DataFrame to merge
  • input_collection2 (str) – datastore key of the second pandas.DataFrame to merge
  • output_collection (str) – datastore key of the merged output pandas.DataFrame
  • how (str) – merge modus. See pandas documentation.
  • on (list) – column names. See pandas documentation.
  • columns1 (list) – column names of the first pandas.DataFrame. Only these columns are included in the merge. If not set, use all columns.
  • columns2 (list) – column names of the second pandas.DataFrame. Only these columns are included in the merge. If not set, use all columns.
  • remove_duplicate_cols2 (bool) – if True duplicate columns will be taken out before the merge (default=True)
  • kwargs – all other key word arguments are passed on to the pandas merge function.
execute()¶

Perform merging of input dataframes.

initialize()¶

Perform basic checks on provided attributes.

class eskapade.analysis.links.RandomSampleSplitter(**kwargs)¶

Bases: escore.core.element.Link

Link that randomly assigns records of an input dataframe to a number of classes.

After assigning classes does one of the following:

  • splits the input dataframe into sub dataframes according classes and stores the sub dataframes into the datastore;
  • add a new column with assigned classes to the dataframe.

Records are assigned randomly.

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • name (str) – name of link
  • read_key (str) – key of data to read from datastore
  • store_key (list) – keys of datasets to store in datastore. Number of sub samples equals length of store_key list (optional instead of ‘column’ and ‘nclasses’).
  • column (str) – name of new column that specifies the randomly assigned class. Default is randomclass (optional instead of ‘store_key’).
  • nclasses (int) – number of random classes. Needs to be set (optional instead of ‘store_key’).
  • fractions (list) – list of fractions (0<fraction<1) of records assigned to the sub samples. Can be one less than n classes. Sum can be less than 1. Needs to be set.
  • nevents (list) – list of number of random records assigned to the sub samples Can be one less than n classes (optional instead of ‘fractions’).
execute()¶

Execute the link.

initialize()¶

Check and initialize attributes of the link.

class eskapade.analysis.links.ReadToDf(**kwargs)¶

Bases: escore.core.element.Link

Reads input file(s) to a pandas dataframe.

You give the link a path where your file is located and some kwargs that go into a pandas DataFrame. The kwargs are passed into the file reader.

__init__(**kwargs)¶

Initialize link instance.

Store the configuration of link ReadToDf.

Parameters:
  • name (str) – Name given to the link
  • path (str) – path of your file to read into pandas DataFrame .
  • key (str) – storage key for the DataStore.
  • reader – reader is determined automatically. But can be set by hand, e.g. csv, xlsx. To use the numpy reader one of the following should be true:
  • reader is {‘numpy’, ‘np’, ‘npy’, ‘npz’}
  • path contains extensions {‘npy’, ‘npz’}
  • param file_type is {‘npy’, ‘npz’}

To use the feather reader one of the following should be true:

  • reader is {‘feather’, ‘ft’}
  • path contains extensions ‘ft’

When to use feather or which numpy type see the esk210_dataframe_restoration tutorial :param bool restore_index: whether to store the index in the metadata. Default is False when the index is numeric, True otherwise. :param str file_type: {‘npy’, ‘npz’} when using the numpy reader Optional, see reader for details. :param bool itr_over_files: Iterate over individual files, default is false. If false, are files are collected in one dataframe. NB chunksize takes priority! :param int chunksize: Default is none. If positive integer then will always iterate. chunksize requires pd.read_csv or pd.read_table. :param int n_files_in_fork: number of files to process if forked. Default is 1. :param kwargs: all other key word arguments are passed on to the pandas reader.

config_lock¶

Get lock status of configuration

Default lock status is False.

Returns:lock status of configuration
Return type:bool
configure_paths(lock: bool = False) → None¶

Configure paths used during exectute

This is the final part of initialization, and needs to be redone in case of forked processing. Hence this function is split off into a separate function. The function can be locked once the configuration is final.

Parameters:lock (bool) – if True, lock this part of the configuration
execute()¶

Execute the link.

Reads the input file(s) and puts the dataframe in the datastore.

initialize()¶

Initialize the link.

is_finished() → bool¶

Try to assess if looper is done iterating over files.

Assess if looper is done or if a next dataset is still coming up.

latest_data_length()¶

Return length of current dataset.

set_chunk_size(size)¶

Set chunksize setting.

Parameters:size – chunk size
sum_data_length()¶

Return sum length of all datasets processed sofar.

class eskapade.analysis.links.RecordFactorizer(**kwargs)¶

Bases: escore.core.element.Link

Factorize data-frame columns.

Perform factorization of input column of an input dataframe. E.g. a columnn x with values ‘apple’, ‘tree’, ‘pear’, ‘apple’, ‘pear’ is tranformed into columns x with values 0, 1, 2, 0, 2, etc. Resulting dataset stored as new dataset. Alternatively, map transformed columns back to orginal format.

__init__(**kwargs)¶

Initialize link instance.

Store and do basic check on the attributes of link RecordFactorizer

Parameters:
  • read_key (str) – key to read dataframe from the data store. Dataframe of records that is to be transformed.
  • columns (list) – list of columns that are to be factorized
  • inplace (bool) – replace original columns. Default is False. Overwrites store_key to read_key.
  • convert_all_categories (bool) – if true, convert all catergory observables. Default is false.
  • convert_all_booleans (bool) – if true, convert all boolean observables. Default is false.
  • map_to_original (dict) – dictiorary or key to dictionary to map back factorized columns to original. map_to_original is a dict of dicts, one dict for each column.
  • store_key (str) – store key of output dataFrame. Default is read_key + ‘_fact’. (optional)
  • sk_map_to_original (str) – store key of dictiorary to map factorized columns to original. Default is ‘key’ + ‘_’ + store_key + ‘_to_original’. (optional)
  • sk_map_to_factorized (str) – store key of dictiorary to map original to factorized columns. Default is ‘key’ + ‘_’ + read_key + ‘_to_factorized’. (optional)
execute()¶

Execute the link.

Perform factorization input columns ‘columns’ of input dataframe. Resulting dataset stored as new dataset. Alternatively, map transformed columns back to orginal format.

initialize()¶

Initialize the link.

Initialize and (further) check the assigned attributes of the RecordFactorizer

class eskapade.analysis.links.RecordVectorizer(**kwargs)¶

Bases: escore.core.element.Link

Vectorize data-frame columns.

Perform vectorization of input column of an input dataframe. E.g. a columnn x with values 1, 2 is tranformed into columns x_1 and x_2, with values True or False assigned per record.

__init__(**kwargs)¶

Initialize link instance.

Store and do basic check on the attributes of link RecordVectorizer.

Parameters:
  • read_key (str) – key to read dataframe from the data store. Dataframe of records that is to be transformed.
  • columns (list) – list of columns that are to be vectorized
  • store_key (str) – store key of output dataFrame. Default is read_key + ‘_vectorized’. (optional)
  • column_compare_with (dict) – dict of unique items per column with which column values are compared. If not given, this is derived automatically from the column. (optional)
  • astype (type) – store answer of comparison of column with value as certain type. Default is bool. (optional)
execute()¶

Execute the link.

Perform vectorization input column ‘column’ of input dataframe. Resulting dataset stored as new dataset.

initialize()¶

Initialize the link.

Initialize and (further) check the assigned attributes of RecordVectorizer.

class eskapade.analysis.links.ValueCounter(**kwargs)¶

Bases: eskapade.analysis.histogram_filling.HistogramFillerBase

Count values in Pandas data frame.

ValueCounter does value_counts() on single columns of a pandas dataframe, or groupby().size() on multiple columns. Results of both are returned as same-style dictionaries.

Numeric and timestamp columns are converted to bin indices before the binning is applied. The binning can be provided as input.

It is possible to do cleaning of these dicts by rejecting certain keys or removing inconsistent data types. Results are stored as 1D Histograms or as ValueCounts objects.

Example is available in: tutorials/esk302_histogram_filling_plotting.py

__init__(**kwargs)¶

Initialize link instance.

Parameters:
  • name (str) – name of link
  • read_key (str) – key of input data to read from data store
  • store_key_counts (str) – key of output data to store ValueCounts objects in data store
  • store_key_hists (str) – key of output data to store histograms in data store
  • columns (list) – columns to pick up from input data (default is all columns)
  • bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns

Example bin_specs dictionary is:

>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0},
>>>              'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]},
>>>              'date': {'bin_width': np.timedelta64(30, 'D'),
>>>                       'bin_offset': np.datetime64('2010-01-04')}}
Parameters:
  • var_dtype (dict) – dict of datatypes of the columns to study from dataframe. If not provided, try to determine datatypes directy from dataframe.
  • store_at_finalize (bool) – Store histograms and/or ValueCount object in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
  • drop_inconsistent_key_types (bool) – cleanup histograms and/or ValueCount objects by removing alls bins/keys with inconsistent datatypes. By default compare with data types in var_dtype dictionary.
  • dict (drop_keys) – dictionary used for dropping specific keys from created value_counts dictionaries

Example drop_keys dictionary is:

>>> drop_keys = {'x': [1, 4, 8, 19],
>>>              'y': ['apple', 'pear', 'tomato'],
>>>              'x:y': [(1, 'apple'), (19, 'tomato')]}
drop_inconsistent_keys(columns, obj)¶

Drop inconsistent keys.

Drop inconsistent keys from a ValueCounts or Histogram object.

Parameters:
  • columns (list) – columns key to retrieve desired datatypes
  • obj (object) – ValueCounts or Histogram object to drop inconsistent keys from
fill_histogram(idf, columns)¶

Fill input histogram with column(s) of input dataframe.

Parameters:
  • idf – input data frame used for filling histogram
  • columns (list) – histogram column(s)
finalize()¶

Finalize ValueCounter.

initialize()¶

Initialize the link.

process_and_store()¶

Make, clean, and store ValueCount objects.

process_columns(df)¶

Process columns before histogram filling.

Specifically, convert timestamp columns to integers and numeric variables are converted to indices

Parameters:df – input (pandas) data frame
Returns:output (pandas) data frame with converted timestamp columns
Return type:pandas DataFrame
class eskapade.analysis.links.WriteFromDf(**kwargs)¶

Bases: escore.core.element.Link

Write a DataFrame from the DataStore to disk.

__init__(**kwargs)¶

Store the configuration of the link.

Parameters:
  • name (str) – Name given to the link
  • key (str) – the DataStore key
  • path (str) – path where to save the DataFrame
  • writer – file extension that can be written by a pandas writer function from pd.DataFrame, or the numpy- feather writers. For example: ‘csv’ will trigger the DataFrame.to_csv. To use numpy_writer specify one of the following:

{‘numpy’, ‘np’, ‘npy’, ‘npz’, }

To use feather specify: {‘feather’, ‘ft’} If writer is not passed the path must contain a known file extension. Valid numpy extensions {‘npy’, ‘npz’} or feather {‘ft’}

Note:

the numpy and feather writers will preserve the metadata such as dtypes for each column and the index if non numeric.

Parameters:
  • dictionary (dict) – keys (as in the arg above) and paths (as in the arg above) it will write out all the keys to the associated paths.
  • add_counter_to_name (bool) – if true, add an index to the output file name. Useful when running in loops. Default is false.
  • store_index (bool) – whether the index should be stored as metadata. Default is False unless the index is non-numeric
  • kwargs – all other key word arguments are passed on to the pandas writers.
execute()¶

Execute the link.

Pick up the dataframe and write to disk.

initialize()¶

Initialize the link.

class eskapade.analysis.links.HistogrammarFiller(**kwargs)¶

Bases: eskapade.analysis.histogram_filling.HistogramFillerBase

Fill histogrammar sparse-bin histograms.

Algorithm to fill histogrammar style sparse-bin and category histograms.

It is possible to do after-filling cleaning of these histograms by rejecting certain keys or removing inconsistent data types. Timestamp columns are converted to nanoseconds before the binning is applied. Final histograms are stored in the datastore.

Example is available in: tutorials/esk303_hgr_filler_plotter.py

__init__(**kwargs)¶

Initialize link instance.

Store and do basic check on the attributes of link HistogrammarFiller.

Parameters:
  • name (str) – name of link
  • read_key (str) – key of input data to read from data store
  • store_key (str) – key of output data to store histograms in data store
  • columns (list) – colums to pick up from input data. (default is all columns)
  • bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns

Example bin_specs dictionary is:

>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0},
                 'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}}
Parameters:
  • var_dtype (dict) – dict of datatypes of the columns to study from dataframe If not provided, try to determine datatypes directy from dataframe.
  • quantity (dict) – dictionary of lambda functions of how to pars certain columns

Example quantity dictionary is:

>>> quantity = {'y': lambda x: x}
Parameters:
  • store_at_finalize (bool) – Store histograms in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
  • dict (drop_keys) – dictionary used for dropping specific keys from bins dictionaries of histograms

Example drop_keys dictionary is:

>>> drop_keys = {'x': [1, 4, 8, 19],
                 'y': ['apple', 'pear', 'tomato'],
                 'x:y': [(1, 'apple'), (19, 'tomato')]}
construct_empty_hist(columns)¶

Create an (empty) histogram of right type.

Create a multi-dim histogram by iterating through the columns in reverse order and passing a single-dim hist as input to the next column.

Parameters:columns (list) – histogram columns
Returns:created histogram
Return type:histogrammar.Count
fill_histogram(idf, columns)¶

Fill input histogram with column(s) of input dataframe.

Parameters:
  • idf – input data frame used for filling histogram
  • columns (list) – histogram column(s)
process_and_store()¶

Process and store histogrammar objects.

Next Previous

© Copyright 2018, KPMG Advisory N.V. Revision 8659c635.

Built with Sphinx using a theme provided by Read the Docs.