eskapade.analysis.links package¶
Submodules¶
eskapade.analysis.links.apply_func_to_df module¶
Project: Eskapade - A python-based package for data analysis.
Class: ApplyFuncToDf
Created: 2016/11/08
- Description:
- Algorithm to apply one or more functions to a (grouped) dataframe column or to an entire dataframe.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.apply_func_to_df.
ApplyFuncToDf
(**kwargs)¶ Bases:
escore.core.element.Link
Apply functions to data-frame.
Applies one or more functions to a (grouped) dataframe column or an entire dataframe. In the latter case, this can be done row wise or column wise. The input dataframe will be overwritten.
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - read_key (str) – data-store input key
- store_key (str) – data-store output key
- apply_funcs (list) – functions to apply (list of dicts) - ‘func’: function to apply - ‘colout’ (string): output column - ‘colin’ (string, optional): input column - ‘entire’ (boolean, optional): apply to the entire dataframe? - ‘args’ (tuple, optional): args for ‘func’ - ‘kwargs’ (dict, optional): kwargs for ‘func’ - ‘groupby’ (list, optional): column names to group by - ‘groupbyColout’ (string) output column after the split-apply-combine combination
- add_columns (dict) – columns to add to output (name, column)
-
add_apply_func
(func, out_column, in_column='', *args, **kwargs)¶ Add function to be applied to dataframe.
-
execute
()¶ Execute the link.
-
groupbyapply
(df, groupby_columns, applyfunc, *args, **kwargs)¶ Apply groupby to dataframe.
-
initialize
()¶ Initialize the link.
-
eskapade.analysis.links.apply_selection_to_df module¶
Project: Eskapade - A python-based package for data analysis.
Class: ApplySelectionToDf
Created: 2016/11/08
- Description:
- Algorithm to apply queries to input dataframe
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.apply_selection_to_df.
ApplySelectionToDf
(**kwargs)¶ Bases:
escore.core.element.Link
Applies queries with sub-selections to a pandas dataframe.
-
__init__
(**kwargs)¶ Initialize link instance.
Input dataframe is not overwritten, unless instructed to do so in kwargs.
Parameters: - name (str) – name of link
- read_key (str) – key of data to read from data store
- store_key (str) – key of data to store in data store. If not set read_key is overwritten.
- query_set (list) – list of strings, query expressions to evaluate in the same order, see pandas documentation
- select_columns (list) – column names to select after querying
- continue_if_failure (bool) – if True continues with next query after failure (optional)
- kwargs – all other key word arguments are passed on to the pandas queries.
-
execute
()¶ Execute the link.
Applies queries or column selection to a pandas DataFrame. Input dataframe is not overwritten, unless told to do so in kwargs.
- Apply queries, in order of provided query list.
- Select columns (if provided).
-
initialize
()¶ Initialize the link.
Perform checks on provided attributes.
-
eskapade.analysis.links.basic_generator module¶
Project: Eskapade - A python-based package for data analysis.
Class: BasicGenerator
Created: 2017/02/26
- Description:
- Link to generate random data with basic distributions
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.basic_generator.
BasicGenerator
(**kwargs)¶ Bases:
escore.core.element.Link
Generate data with basic distributions.
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - key (str) – key of output data in data store
- columns (list) – output column names
- size (int) – number of variable values
- gen_config (dict) – generator configuration for each variable
- gen_seed (int) – generator random seed
-
execute
()¶ Execute the link.
-
initialize
()¶ Initialize the link.
-
eskapade.analysis.links.df_concatenator module¶
Project: Eskapade - A python-based package for data analysis.
Class: DataFrameColumnRenamer
Created: 2016/11/08
- Description:
- Algorithm to concatenate multiple pandas datadrames
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.df_concatenator.
DfConcatenator
(**kwargs)¶ Bases:
escore.core.element.Link
Concatenates multiple pandas datadrames.
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - name (str) – name of link
- store_key (str) – key of data to store in data store
- read_keys (list) – keys of pandas dataframes in the data store
- ignore_missing_input (bool) – Skip missing input datasets. If all missing, store empty dataset. Default is false.
- kwargs – all other key word arguments are passed on to pandas concat function.
-
execute
()¶ Execute the link.
Perform concatenation of multiple pandas datadrames.
-
initialize
()¶ Initialize the link.
-
eskapade.analysis.links.df_merger module¶
Project: Eskapade - A python-based package for data analysis.
Class: DfMerger
Created: 2016/11/08
- Description:
- Algorithm to Merges two pandas DataFrames
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.df_merger.
DfMerger
(**kwargs)¶ Bases:
escore.core.element.Link
Merges two pandas dataframes.
-
__init__
(**kwargs)¶ Initialize link instance.
Store the configuration of the link.
Parameters: - name (str) – name of link
- input_collection1 (str) – datastore key of the first pandas.DataFrame to merge
- input_collection2 (str) – datastore key of the second pandas.DataFrame to merge
- output_collection (str) – datastore key of the merged output pandas.DataFrame
- how (str) – merge modus. See pandas documentation.
- on (list) – column names. See pandas documentation.
- columns1 (list) – column names of the first pandas.DataFrame. Only these columns are included in the merge. If not set, use all columns.
- columns2 (list) – column names of the second pandas.DataFrame. Only these columns are included in the merge. If not set, use all columns.
- remove_duplicate_cols2 (bool) – if True duplicate columns will be taken out before the merge (default=True)
- kwargs – all other key word arguments are passed on to the pandas merge function.
-
execute
()¶ Perform merging of input dataframes.
-
initialize
()¶ Perform basic checks on provided attributes.
-
eskapade.analysis.links.histogrammar_filler module¶
eskapade.analysis.links.random_sample_splitter module¶
Project: Eskapade - A python-based package for data analysis.
Class: RandomSampleSplitter
Created: 2016/11/08
- Description:
- Algorithm to randomly assign records to a number of classes
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.random_sample_splitter.
RandomSampleSplitter
(**kwargs)¶ Bases:
escore.core.element.Link
Link that randomly assigns records of an input dataframe to a number of classes.
After assigning classes does one of the following:
- splits the input dataframe into sub dataframes according classes and stores the sub dataframes into the datastore;
- add a new column with assigned classes to the dataframe.
Records are assigned randomly.
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - name (str) – name of link
- read_key (str) – key of data to read from datastore
- store_key (list) – keys of datasets to store in datastore. Number of sub samples equals length of store_key list (optional instead of ‘column’ and ‘nclasses’).
- column (str) – name of new column that specifies the randomly assigned class. Default is randomclass (optional instead of ‘store_key’).
- nclasses (int) – number of random classes. Needs to be set (optional instead of ‘store_key’).
- fractions (list) – list of fractions (0<fraction<1) of records assigned to the sub samples. Can be one less than n classes. Sum can be less than 1. Needs to be set.
- nevents (list) – list of number of random records assigned to the sub samples Can be one less than n classes (optional instead of ‘fractions’).
-
execute
()¶ Execute the link.
-
initialize
()¶ Check and initialize attributes of the link.
eskapade.analysis.links.read_to_df module¶
Project: Eskapade - A python-based package for data analysis.
Class: ReadToDf
Created: 2016/11/08
- Description:
- Algorithm to write pandas dataframes picked up from the datastore.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.read_to_df.
ReadToDf
(**kwargs)¶ Bases:
escore.core.element.Link
Reads input file(s) to a pandas dataframe.
You give the link a path where your file is located and some kwargs that go into a pandas DataFrame. The kwargs are passed into the file reader.
-
__init__
(**kwargs)¶ Initialize link instance.
Store the configuration of link ReadToDf.
Parameters: - name (str) – Name given to the link
- path (str) – path of your file to read into pandas DataFrame .
- key (str) – storage key for the DataStore.
- reader – reader is determined automatically. But can be set by hand, e.g. csv, xlsx. To use the numpy reader one of the following should be true:
- reader is {‘numpy’, ‘np’, ‘npy’, ‘npz’}
- path contains extensions {‘npy’, ‘npz’}
- param file_type is {‘npy’, ‘npz’}
To use the feather reader one of the following should be true:
- reader is {‘feather’, ‘ft’}
- path contains extensions ‘ft’
When to use feather or which numpy type see the esk210_dataframe_restoration tutorial :param bool restore_index: whether to store the index in the metadata. Default is False when the index is numeric, True otherwise. :param str file_type: {‘npy’, ‘npz’} when using the numpy reader Optional, see reader for details. :param bool itr_over_files: Iterate over individual files, default is false. If false, are files are collected in one dataframe. NB chunksize takes priority! :param int chunksize: Default is none. If positive integer then will always iterate. chunksize requires pd.read_csv or pd.read_table. :param int n_files_in_fork: number of files to process if forked. Default is 1. :param kwargs: all other key word arguments are passed on to the pandas reader.
-
config_lock
¶ Get lock status of configuration
Default lock status is False.
Returns: lock status of configuration Return type: bool
-
configure_paths
(lock: bool = False) → None¶ Configure paths used during exectute
This is the final part of initialization, and needs to be redone in case of forked processing. Hence this function is split off into a separate function. The function can be locked once the configuration is final.
Parameters: lock (bool) – if True, lock this part of the configuration
-
execute
()¶ Execute the link.
Reads the input file(s) and puts the dataframe in the datastore.
-
initialize
()¶ Initialize the link.
-
is_finished
() → bool¶ Try to assess if looper is done iterating over files.
Assess if looper is done or if a next dataset is still coming up.
-
latest_data_length
()¶ Return length of current dataset.
-
set_chunk_size
(size)¶ Set chunksize setting.
Parameters: size – chunk size
-
sum_data_length
()¶ Return sum length of all datasets processed sofar.
-
-
eskapade.analysis.links.read_to_df.
feather_reader
(path, restore_index)¶ Read from feather file from disk to DataFrame, restoring the metadata
Parameters: - path (str) – target file location
- restore_index (bool) – store index in DataFrame Default is True
Returns df: the DF read from disk
Return type: pd.DataFrame
-
eskapade.analysis.links.read_to_df.
numpy_reader
(path, restore_index, file_type)¶ Read from numpy file from disk to DataFrame, restoring the metadata
Parameters: - path (str) – target file location
- restore_index (bool) – store index in DataFrame Default is True
- file_type (str) – the file type used {‘npy’, ‘npz’}
Raises: - AmbiguousFileType – when we can’t determine whether the file type is npy or npz
- UnhandledFileType – generic catch for when the type logic fails to exclude case
Returns df: the DF read from disk
Return type: pd.DataFrame
-
eskapade.analysis.links.read_to_df.
set_reader
(path, reader, *args, **kwargs)¶ Pick the correct reader.
Based on provided reader setting, or based on file extension.
eskapade.analysis.links.record_factorizer module¶
Project: Eskapade - A python-based package for data analysis.
Class: RecordFactorizer
Created: 2016/11/08
- Description:
- Algorithm to perform the factorization of an input column of an input dataframe. E.g. a columnn x with values ‘apple’, ‘tree’, ‘pear’, ‘apple’, ‘pear’ is tranformed into columns x with values 0, 1, 2, 0, 2, etc.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.record_factorizer.
RecordFactorizer
(**kwargs)¶ Bases:
escore.core.element.Link
Factorize data-frame columns.
Perform factorization of input column of an input dataframe. E.g. a columnn x with values ‘apple’, ‘tree’, ‘pear’, ‘apple’, ‘pear’ is tranformed into columns x with values 0, 1, 2, 0, 2, etc. Resulting dataset stored as new dataset. Alternatively, map transformed columns back to orginal format.
-
__init__
(**kwargs)¶ Initialize link instance.
Store and do basic check on the attributes of link RecordFactorizer
Parameters: - read_key (str) – key to read dataframe from the data store. Dataframe of records that is to be transformed.
- columns (list) – list of columns that are to be factorized
- inplace (bool) – replace original columns. Default is False. Overwrites store_key to read_key.
- convert_all_categories (bool) – if true, convert all catergory observables. Default is false.
- convert_all_booleans (bool) – if true, convert all boolean observables. Default is false.
- map_to_original (dict) – dictiorary or key to dictionary to map back factorized columns to original. map_to_original is a dict of dicts, one dict for each column.
- store_key (str) – store key of output dataFrame. Default is read_key + ‘_fact’. (optional)
- sk_map_to_original (str) – store key of dictiorary to map factorized columns to original. Default is ‘key’ + ‘_’ + store_key + ‘_to_original’. (optional)
- sk_map_to_factorized (str) – store key of dictiorary to map original to factorized columns. Default is ‘key’ + ‘_’ + read_key + ‘_to_factorized’. (optional)
-
execute
()¶ Execute the link.
Perform factorization input columns ‘columns’ of input dataframe. Resulting dataset stored as new dataset. Alternatively, map transformed columns back to orginal format.
-
initialize
()¶ Initialize the link.
Initialize and (further) check the assigned attributes of the RecordFactorizer
-
eskapade.analysis.links.record_vectorizer module¶
Project: Eskapade - A python-based package for data analysis.
Class: RecordVectorizer
Created: 2016/11/08
- Description:
- Algorithm to perform the vectorization of an input column of an input dataframe.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.record_vectorizer.
RecordVectorizer
(**kwargs)¶ Bases:
escore.core.element.Link
Vectorize data-frame columns.
Perform vectorization of input column of an input dataframe. E.g. a columnn x with values 1, 2 is tranformed into columns x_1 and x_2, with values True or False assigned per record.
-
__init__
(**kwargs)¶ Initialize link instance.
Store and do basic check on the attributes of link RecordVectorizer.
Parameters: - read_key (str) – key to read dataframe from the data store. Dataframe of records that is to be transformed.
- columns (list) – list of columns that are to be vectorized
- store_key (str) – store key of output dataFrame. Default is read_key + ‘_vectorized’. (optional)
- column_compare_with (dict) – dict of unique items per column with which column values are compared. If not given, this is derived automatically from the column. (optional)
- astype (type) – store answer of comparison of column with value as certain type. Default is bool. (optional)
-
execute
()¶ Execute the link.
Perform vectorization input column ‘column’ of input dataframe. Resulting dataset stored as new dataset.
-
initialize
()¶ Initialize the link.
Initialize and (further) check the assigned attributes of RecordVectorizer.
-
-
eskapade.analysis.links.record_vectorizer.
record_vectorizer
(df, column_to_vectorize, column_compare_set, astype=<class 'bool'>)¶ Vectorize data-frame column.
Takes the new record that is already transformed and vectorizes the given columns.
Parameters: - df – dataframe of the new record to vectorize
- column_to_vectorize (str) – string, column in the new record to vectorize.
- column_compare_set (list) – list of values to compare the column with.
Returns: dataframe of the new records.
eskapade.analysis.links.value_counter module¶
Project: Eskapade - A python-based package for data analysis.
Class: ValueCounter
Created: 2017/03/02
- Description:
- Algorithm to do value_counts() on single columns of a pandas dataframe, or groupby().size() on multiple columns, both returned as dictionaries. It is possible to do cleaning of these dicts by rejecting certain keys or removing inconsistent data types. Numeric and timestamp columns are converted to bin indices before the binning is applied. Results are stored as 1D Histograms or as ValueCounts objects.
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.value_counter.
ValueCounter
(**kwargs)¶ Bases:
eskapade.analysis.histogram_filling.HistogramFillerBase
Count values in Pandas data frame.
ValueCounter does value_counts() on single columns of a pandas dataframe, or groupby().size() on multiple columns. Results of both are returned as same-style dictionaries.
Numeric and timestamp columns are converted to bin indices before the binning is applied. The binning can be provided as input.
It is possible to do cleaning of these dicts by rejecting certain keys or removing inconsistent data types. Results are stored as 1D Histograms or as ValueCounts objects.
Example is available in: tutorials/esk302_histogram_filling_plotting.py
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - name (str) – name of link
- read_key (str) – key of input data to read from data store
- store_key_counts (str) – key of output data to store ValueCounts objects in data store
- store_key_hists (str) – key of output data to store histograms in data store
- columns (list) – columns to pick up from input data (default is all columns)
- bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns
Example bin_specs dictionary is:
>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0}, >>> 'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}, >>> 'date': {'bin_width': np.timedelta64(30, 'D'), >>> 'bin_offset': np.datetime64('2010-01-04')}}
Parameters: - var_dtype (dict) – dict of datatypes of the columns to study from dataframe. If not provided, try to determine datatypes directy from dataframe.
- store_at_finalize (bool) – Store histograms and/or ValueCount object in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
- drop_inconsistent_key_types (bool) – cleanup histograms and/or ValueCount objects by removing alls bins/keys with inconsistent datatypes. By default compare with data types in var_dtype dictionary.
- dict (drop_keys) – dictionary used for dropping specific keys from created value_counts dictionaries
Example drop_keys dictionary is:
>>> drop_keys = {'x': [1, 4, 8, 19], >>> 'y': ['apple', 'pear', 'tomato'], >>> 'x:y': [(1, 'apple'), (19, 'tomato')]}
-
drop_inconsistent_keys
(columns, obj)¶ Drop inconsistent keys.
Drop inconsistent keys from a ValueCounts or Histogram object.
Parameters: - columns (list) – columns key to retrieve desired datatypes
- obj (object) – ValueCounts or Histogram object to drop inconsistent keys from
-
fill_histogram
(idf, columns)¶ Fill input histogram with column(s) of input dataframe.
Parameters: - idf – input data frame used for filling histogram
- columns (list) – histogram column(s)
-
finalize
()¶ Finalize ValueCounter.
-
initialize
()¶ Initialize the link.
-
process_and_store
()¶ Make, clean, and store ValueCount objects.
-
process_columns
(df)¶ Process columns before histogram filling.
Specifically, convert timestamp columns to integers and numeric variables are converted to indices
Parameters: df – input (pandas) data frame Returns: output (pandas) data frame with converted timestamp columns Return type: pandas DataFrame
-
eskapade.analysis.links.write_from_df module¶
Project: Eskapade - A python-based package for data analysis.
Class: WriteFromDf
Created: 2016/11/08
- Description:
- Algorithm to write a DataFrame from the DataStore to disk
- Authors:
- KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands
Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.
-
class
eskapade.analysis.links.write_from_df.
WriteFromDf
(**kwargs)¶ Bases:
escore.core.element.Link
Write a DataFrame from the DataStore to disk.
-
__init__
(**kwargs)¶ Store the configuration of the link.
Parameters: - name (str) – Name given to the link
- key (str) – the DataStore key
- path (str) – path where to save the DataFrame
- writer – file extension that can be written by a pandas writer function from pd.DataFrame, or the numpy- feather writers. For example: ‘csv’ will trigger the DataFrame.to_csv. To use numpy_writer specify one of the following:
{‘numpy’, ‘np’, ‘npy’, ‘npz’, }
To use feather specify: {‘feather’, ‘ft’} If writer is not passed the path must contain a known file extension. Valid numpy extensions {‘npy’, ‘npz’} or feather {‘ft’}
Note: the numpy and feather writers will preserve the metadata such as dtypes for each column and the index if non numeric.
Parameters: - dictionary (dict) – keys (as in the arg above) and paths (as in the arg above) it will write out all the keys to the associated paths.
- add_counter_to_name (bool) – if true, add an index to the output file name. Useful when running in loops. Default is false.
- store_index (bool) – whether the index should be stored as metadata. Default is False unless the index is non-numeric
- kwargs – all other key word arguments are passed on to the pandas writers.
-
execute
()¶ Execute the link.
Pick up the dataframe and write to disk.
-
initialize
()¶ Initialize the link.
-
-
eskapade.analysis.links.write_from_df.
feather_writer
(df, path, store_index)¶ Write df to disk in feather format; preserving the metadata
Parameters: - df (DataFrame) – pandas Dataframe to write out
- path (str) – target file location
- store_index (bool) – store index in DataFrame, default is True
-
eskapade.analysis.links.write_from_df.
get_writer
(path, writer, *args, **kwargs)¶ Pick the correct writer.
Based on provided writer setting, or based on file extension.
-
eskapade.analysis.links.write_from_df.
numpy_writer
(df, path, store_index)¶ Write df to disk in numpy format; preserving the metadata
Parameters: - df (DataFrame) – pandas Dataframe to write out
- path (str) – target file location
- store_index (bool) – store index in DataFrame
Module contents¶
-
class
eskapade.analysis.links.
ApplyFuncToDf
(**kwargs)¶ Bases:
escore.core.element.Link
Apply functions to data-frame.
Applies one or more functions to a (grouped) dataframe column or an entire dataframe. In the latter case, this can be done row wise or column wise. The input dataframe will be overwritten.
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - read_key (str) – data-store input key
- store_key (str) – data-store output key
- apply_funcs (list) – functions to apply (list of dicts) - ‘func’: function to apply - ‘colout’ (string): output column - ‘colin’ (string, optional): input column - ‘entire’ (boolean, optional): apply to the entire dataframe? - ‘args’ (tuple, optional): args for ‘func’ - ‘kwargs’ (dict, optional): kwargs for ‘func’ - ‘groupby’ (list, optional): column names to group by - ‘groupbyColout’ (string) output column after the split-apply-combine combination
- add_columns (dict) – columns to add to output (name, column)
-
add_apply_func
(func, out_column, in_column='', *args, **kwargs)¶ Add function to be applied to dataframe.
-
execute
()¶ Execute the link.
-
groupbyapply
(df, groupby_columns, applyfunc, *args, **kwargs)¶ Apply groupby to dataframe.
-
initialize
()¶ Initialize the link.
-
-
class
eskapade.analysis.links.
ApplySelectionToDf
(**kwargs)¶ Bases:
escore.core.element.Link
Applies queries with sub-selections to a pandas dataframe.
-
__init__
(**kwargs)¶ Initialize link instance.
Input dataframe is not overwritten, unless instructed to do so in kwargs.
Parameters: - name (str) – name of link
- read_key (str) – key of data to read from data store
- store_key (str) – key of data to store in data store. If not set read_key is overwritten.
- query_set (list) – list of strings, query expressions to evaluate in the same order, see pandas documentation
- select_columns (list) – column names to select after querying
- continue_if_failure (bool) – if True continues with next query after failure (optional)
- kwargs – all other key word arguments are passed on to the pandas queries.
-
execute
()¶ Execute the link.
Applies queries or column selection to a pandas DataFrame. Input dataframe is not overwritten, unless told to do so in kwargs.
- Apply queries, in order of provided query list.
- Select columns (if provided).
-
initialize
()¶ Initialize the link.
Perform checks on provided attributes.
-
-
class
eskapade.analysis.links.
BasicGenerator
(**kwargs)¶ Bases:
escore.core.element.Link
Generate data with basic distributions.
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - key (str) – key of output data in data store
- columns (list) – output column names
- size (int) – number of variable values
- gen_config (dict) – generator configuration for each variable
- gen_seed (int) – generator random seed
-
execute
()¶ Execute the link.
-
initialize
()¶ Initialize the link.
-
-
class
eskapade.analysis.links.
DfConcatenator
(**kwargs)¶ Bases:
escore.core.element.Link
Concatenates multiple pandas datadrames.
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - name (str) – name of link
- store_key (str) – key of data to store in data store
- read_keys (list) – keys of pandas dataframes in the data store
- ignore_missing_input (bool) – Skip missing input datasets. If all missing, store empty dataset. Default is false.
- kwargs – all other key word arguments are passed on to pandas concat function.
-
execute
()¶ Execute the link.
Perform concatenation of multiple pandas datadrames.
-
initialize
()¶ Initialize the link.
-
-
class
eskapade.analysis.links.
DfMerger
(**kwargs)¶ Bases:
escore.core.element.Link
Merges two pandas dataframes.
-
__init__
(**kwargs)¶ Initialize link instance.
Store the configuration of the link.
Parameters: - name (str) – name of link
- input_collection1 (str) – datastore key of the first pandas.DataFrame to merge
- input_collection2 (str) – datastore key of the second pandas.DataFrame to merge
- output_collection (str) – datastore key of the merged output pandas.DataFrame
- how (str) – merge modus. See pandas documentation.
- on (list) – column names. See pandas documentation.
- columns1 (list) – column names of the first pandas.DataFrame. Only these columns are included in the merge. If not set, use all columns.
- columns2 (list) – column names of the second pandas.DataFrame. Only these columns are included in the merge. If not set, use all columns.
- remove_duplicate_cols2 (bool) – if True duplicate columns will be taken out before the merge (default=True)
- kwargs – all other key word arguments are passed on to the pandas merge function.
-
execute
()¶ Perform merging of input dataframes.
-
initialize
()¶ Perform basic checks on provided attributes.
-
-
class
eskapade.analysis.links.
RandomSampleSplitter
(**kwargs)¶ Bases:
escore.core.element.Link
Link that randomly assigns records of an input dataframe to a number of classes.
After assigning classes does one of the following:
- splits the input dataframe into sub dataframes according classes and stores the sub dataframes into the datastore;
- add a new column with assigned classes to the dataframe.
Records are assigned randomly.
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - name (str) – name of link
- read_key (str) – key of data to read from datastore
- store_key (list) – keys of datasets to store in datastore. Number of sub samples equals length of store_key list (optional instead of ‘column’ and ‘nclasses’).
- column (str) – name of new column that specifies the randomly assigned class. Default is randomclass (optional instead of ‘store_key’).
- nclasses (int) – number of random classes. Needs to be set (optional instead of ‘store_key’).
- fractions (list) – list of fractions (0<fraction<1) of records assigned to the sub samples. Can be one less than n classes. Sum can be less than 1. Needs to be set.
- nevents (list) – list of number of random records assigned to the sub samples Can be one less than n classes (optional instead of ‘fractions’).
-
execute
()¶ Execute the link.
-
initialize
()¶ Check and initialize attributes of the link.
-
class
eskapade.analysis.links.
ReadToDf
(**kwargs)¶ Bases:
escore.core.element.Link
Reads input file(s) to a pandas dataframe.
You give the link a path where your file is located and some kwargs that go into a pandas DataFrame. The kwargs are passed into the file reader.
-
__init__
(**kwargs)¶ Initialize link instance.
Store the configuration of link ReadToDf.
Parameters: - name (str) – Name given to the link
- path (str) – path of your file to read into pandas DataFrame .
- key (str) – storage key for the DataStore.
- reader – reader is determined automatically. But can be set by hand, e.g. csv, xlsx. To use the numpy reader one of the following should be true:
- reader is {‘numpy’, ‘np’, ‘npy’, ‘npz’}
- path contains extensions {‘npy’, ‘npz’}
- param file_type is {‘npy’, ‘npz’}
To use the feather reader one of the following should be true:
- reader is {‘feather’, ‘ft’}
- path contains extensions ‘ft’
When to use feather or which numpy type see the esk210_dataframe_restoration tutorial :param bool restore_index: whether to store the index in the metadata. Default is False when the index is numeric, True otherwise. :param str file_type: {‘npy’, ‘npz’} when using the numpy reader Optional, see reader for details. :param bool itr_over_files: Iterate over individual files, default is false. If false, are files are collected in one dataframe. NB chunksize takes priority! :param int chunksize: Default is none. If positive integer then will always iterate. chunksize requires pd.read_csv or pd.read_table. :param int n_files_in_fork: number of files to process if forked. Default is 1. :param kwargs: all other key word arguments are passed on to the pandas reader.
-
config_lock
¶ Get lock status of configuration
Default lock status is False.
Returns: lock status of configuration Return type: bool
-
configure_paths
(lock: bool = False) → None¶ Configure paths used during exectute
This is the final part of initialization, and needs to be redone in case of forked processing. Hence this function is split off into a separate function. The function can be locked once the configuration is final.
Parameters: lock (bool) – if True, lock this part of the configuration
-
execute
()¶ Execute the link.
Reads the input file(s) and puts the dataframe in the datastore.
-
initialize
()¶ Initialize the link.
-
is_finished
() → bool¶ Try to assess if looper is done iterating over files.
Assess if looper is done or if a next dataset is still coming up.
-
latest_data_length
()¶ Return length of current dataset.
-
set_chunk_size
(size)¶ Set chunksize setting.
Parameters: size – chunk size
-
sum_data_length
()¶ Return sum length of all datasets processed sofar.
-
-
class
eskapade.analysis.links.
RecordFactorizer
(**kwargs)¶ Bases:
escore.core.element.Link
Factorize data-frame columns.
Perform factorization of input column of an input dataframe. E.g. a columnn x with values ‘apple’, ‘tree’, ‘pear’, ‘apple’, ‘pear’ is tranformed into columns x with values 0, 1, 2, 0, 2, etc. Resulting dataset stored as new dataset. Alternatively, map transformed columns back to orginal format.
-
__init__
(**kwargs)¶ Initialize link instance.
Store and do basic check on the attributes of link RecordFactorizer
Parameters: - read_key (str) – key to read dataframe from the data store. Dataframe of records that is to be transformed.
- columns (list) – list of columns that are to be factorized
- inplace (bool) – replace original columns. Default is False. Overwrites store_key to read_key.
- convert_all_categories (bool) – if true, convert all catergory observables. Default is false.
- convert_all_booleans (bool) – if true, convert all boolean observables. Default is false.
- map_to_original (dict) – dictiorary or key to dictionary to map back factorized columns to original. map_to_original is a dict of dicts, one dict for each column.
- store_key (str) – store key of output dataFrame. Default is read_key + ‘_fact’. (optional)
- sk_map_to_original (str) – store key of dictiorary to map factorized columns to original. Default is ‘key’ + ‘_’ + store_key + ‘_to_original’. (optional)
- sk_map_to_factorized (str) – store key of dictiorary to map original to factorized columns. Default is ‘key’ + ‘_’ + read_key + ‘_to_factorized’. (optional)
-
execute
()¶ Execute the link.
Perform factorization input columns ‘columns’ of input dataframe. Resulting dataset stored as new dataset. Alternatively, map transformed columns back to orginal format.
-
initialize
()¶ Initialize the link.
Initialize and (further) check the assigned attributes of the RecordFactorizer
-
-
class
eskapade.analysis.links.
RecordVectorizer
(**kwargs)¶ Bases:
escore.core.element.Link
Vectorize data-frame columns.
Perform vectorization of input column of an input dataframe. E.g. a columnn x with values 1, 2 is tranformed into columns x_1 and x_2, with values True or False assigned per record.
-
__init__
(**kwargs)¶ Initialize link instance.
Store and do basic check on the attributes of link RecordVectorizer.
Parameters: - read_key (str) – key to read dataframe from the data store. Dataframe of records that is to be transformed.
- columns (list) – list of columns that are to be vectorized
- store_key (str) – store key of output dataFrame. Default is read_key + ‘_vectorized’. (optional)
- column_compare_with (dict) – dict of unique items per column with which column values are compared. If not given, this is derived automatically from the column. (optional)
- astype (type) – store answer of comparison of column with value as certain type. Default is bool. (optional)
-
execute
()¶ Execute the link.
Perform vectorization input column ‘column’ of input dataframe. Resulting dataset stored as new dataset.
-
initialize
()¶ Initialize the link.
Initialize and (further) check the assigned attributes of RecordVectorizer.
-
-
class
eskapade.analysis.links.
ValueCounter
(**kwargs)¶ Bases:
eskapade.analysis.histogram_filling.HistogramFillerBase
Count values in Pandas data frame.
ValueCounter does value_counts() on single columns of a pandas dataframe, or groupby().size() on multiple columns. Results of both are returned as same-style dictionaries.
Numeric and timestamp columns are converted to bin indices before the binning is applied. The binning can be provided as input.
It is possible to do cleaning of these dicts by rejecting certain keys or removing inconsistent data types. Results are stored as 1D Histograms or as ValueCounts objects.
Example is available in: tutorials/esk302_histogram_filling_plotting.py
-
__init__
(**kwargs)¶ Initialize link instance.
Parameters: - name (str) – name of link
- read_key (str) – key of input data to read from data store
- store_key_counts (str) – key of output data to store ValueCounts objects in data store
- store_key_hists (str) – key of output data to store histograms in data store
- columns (list) – columns to pick up from input data (default is all columns)
- bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns
Example bin_specs dictionary is:
>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0}, >>> 'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}, >>> 'date': {'bin_width': np.timedelta64(30, 'D'), >>> 'bin_offset': np.datetime64('2010-01-04')}}
Parameters: - var_dtype (dict) – dict of datatypes of the columns to study from dataframe. If not provided, try to determine datatypes directy from dataframe.
- store_at_finalize (bool) – Store histograms and/or ValueCount object in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
- drop_inconsistent_key_types (bool) – cleanup histograms and/or ValueCount objects by removing alls bins/keys with inconsistent datatypes. By default compare with data types in var_dtype dictionary.
- dict (drop_keys) – dictionary used for dropping specific keys from created value_counts dictionaries
Example drop_keys dictionary is:
>>> drop_keys = {'x': [1, 4, 8, 19], >>> 'y': ['apple', 'pear', 'tomato'], >>> 'x:y': [(1, 'apple'), (19, 'tomato')]}
-
drop_inconsistent_keys
(columns, obj)¶ Drop inconsistent keys.
Drop inconsistent keys from a ValueCounts or Histogram object.
Parameters: - columns (list) – columns key to retrieve desired datatypes
- obj (object) – ValueCounts or Histogram object to drop inconsistent keys from
-
fill_histogram
(idf, columns)¶ Fill input histogram with column(s) of input dataframe.
Parameters: - idf – input data frame used for filling histogram
- columns (list) – histogram column(s)
-
finalize
()¶ Finalize ValueCounter.
-
initialize
()¶ Initialize the link.
-
process_and_store
()¶ Make, clean, and store ValueCount objects.
-
process_columns
(df)¶ Process columns before histogram filling.
Specifically, convert timestamp columns to integers and numeric variables are converted to indices
Parameters: df – input (pandas) data frame Returns: output (pandas) data frame with converted timestamp columns Return type: pandas DataFrame
-
-
class
eskapade.analysis.links.
WriteFromDf
(**kwargs)¶ Bases:
escore.core.element.Link
Write a DataFrame from the DataStore to disk.
-
__init__
(**kwargs)¶ Store the configuration of the link.
Parameters: - name (str) – Name given to the link
- key (str) – the DataStore key
- path (str) – path where to save the DataFrame
- writer – file extension that can be written by a pandas writer function from pd.DataFrame, or the numpy- feather writers. For example: ‘csv’ will trigger the DataFrame.to_csv. To use numpy_writer specify one of the following:
{‘numpy’, ‘np’, ‘npy’, ‘npz’, }
To use feather specify: {‘feather’, ‘ft’} If writer is not passed the path must contain a known file extension. Valid numpy extensions {‘npy’, ‘npz’} or feather {‘ft’}
Note: the numpy and feather writers will preserve the metadata such as dtypes for each column and the index if non numeric.
Parameters: - dictionary (dict) – keys (as in the arg above) and paths (as in the arg above) it will write out all the keys to the associated paths.
- add_counter_to_name (bool) – if true, add an index to the output file name. Useful when running in loops. Default is false.
- store_index (bool) – whether the index should be stored as metadata. Default is False unless the index is non-numeric
- kwargs – all other key word arguments are passed on to the pandas writers.
-
execute
()¶ Execute the link.
Pick up the dataframe and write to disk.
-
initialize
()¶ Initialize the link.
-
-
class
eskapade.analysis.links.
HistogrammarFiller
(**kwargs)¶ Bases:
eskapade.analysis.histogram_filling.HistogramFillerBase
Fill histogrammar sparse-bin histograms.
Algorithm to fill histogrammar style sparse-bin and category histograms.
It is possible to do after-filling cleaning of these histograms by rejecting certain keys or removing inconsistent data types. Timestamp columns are converted to nanoseconds before the binning is applied. Final histograms are stored in the datastore.
Example is available in: tutorials/esk303_hgr_filler_plotter.py
-
__init__
(**kwargs)¶ Initialize link instance.
Store and do basic check on the attributes of link HistogrammarFiller.
Parameters: - name (str) – name of link
- read_key (str) – key of input data to read from data store
- store_key (str) – key of output data to store histograms in data store
- columns (list) – colums to pick up from input data. (default is all columns)
- bin_specs (dict) – dictionaries used for rebinning numeric or timestamp columns
Example bin_specs dictionary is:
>>> bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0}, 'y': {'bin_edges': [0, 2, 3, 4, 5, 7, 8]}}
Parameters: - var_dtype (dict) – dict of datatypes of the columns to study from dataframe If not provided, try to determine datatypes directy from dataframe.
- quantity (dict) – dictionary of lambda functions of how to pars certain columns
Example quantity dictionary is:
>>> quantity = {'y': lambda x: x}
Parameters: - store_at_finalize (bool) – Store histograms in datastore at finalize(), not at execute(). Useful when looping over datasets. Default is False.
- dict (drop_keys) – dictionary used for dropping specific keys from bins dictionaries of histograms
Example drop_keys dictionary is:
>>> drop_keys = {'x': [1, 4, 8, 19], 'y': ['apple', 'pear', 'tomato'], 'x:y': [(1, 'apple'), (19, 'tomato')]}
-
construct_empty_hist
(columns)¶ Create an (empty) histogram of right type.
Create a multi-dim histogram by iterating through the columns in reverse order and passing a single-dim hist as input to the next column.
Parameters: columns (list) – histogram columns Returns: created histogram Return type: histogrammar.Count
-
fill_histogram
(idf, columns)¶ Fill input histogram with column(s) of input dataframe.
Parameters: - idf – input data frame used for filling histogram
- columns (list) – histogram column(s)
-
process_and_store
()¶ Process and store histogrammar objects.
-