datto package

datto.CleanDataframe module

class datto.CleanDataframe.CleanDataframe[source]

Bases: object

Clean data using NLP, regex, calculations, etc.

batch_merge_operation(df_1, df_2, num_splits, identifier_col, merge_col)[source]

Merge two Pandas DataFrame in chunks for faster processing.

Parameters
  • df_1 (DataFrame) –

  • df_2 (DataFrame) –

  • num_splits (int) –

  • identifier_col (str) –

  • merge_col (str) –

Returns

new_df

Return type

DataFrame

batch_pandas_operation(df, num_splits, identifier_col, func)[source]

Use a function on a Pandas DataFrame in chunks for faster processing.

Parameters
  • df (DataFrame) –

  • num_splits (int) –

  • identifier_col (str) –

  • func (Function) –

Returns

new_df

Return type

DataFrame

clean_column_names(df)[source]

Rename all columns to use underscores to reference columns without bracket formatting

Parameters

df (DataFrame) –

Returns

df

Return type

DataFrame

compress_df(df)[source]

Compresses each dataframe column as much as possible depending on type and values.

Parameters

df (DataFrame) –

Returns

df

Return type

DataFrame

fill_nulls_using_regression_model(X_train, X_test)[source]

Trains a regression model on non-null data and predicts values to fill in nulls

Parameters
  • X_train (pd.DataFrame) –

  • X_test (pd.DataFrme) –

Returns

  • X_train (pd.DataFrame)

  • X_test (pd.DataFrame)

fix_col_data_type(df, col, desired_dt)[source]

Change column datatype using the best method for each type.

Parameters
  • df (DataFrame) –

  • col (str) – Column to change the dtype for

  • desired_dt (str) – {‘float’, ‘int’, ‘datetime’, ‘str’}

Returns

df

Return type

DataFrame

lematize(text)[source]
Parameters

text (str) –

Return type

list of spacy tokens

make_uuid(id_num)[source]

Make a UUid_num from a text string

Parameters

id_num (str) –

Returns

uuid

Return type

str

remove_duplicate_columns(df)[source]

Remove columns with the same name

Parameters

df (DataFrame) –

Returns

df

Return type

DataFrame

remove_email_greetings_signatures(text)[source]

In order to obtain the main text of an email only, this method removes greetings, signoffs, and signatures by identifying sentences with less than 5% verbs to drop. Does not replace links.

Inspiration from: https://github.com/mynameisvinn/EmailParser

Parameters

text (str) –

Returns

text

Return type

str

Parameters

text (str) –

Returns

cleaned_text

Return type

str

remove_names(text)[source]
Parameters

text (str) –

Returns

cleaned_text

Return type

str

remove_pii(text)[source]

Remove common patterns of personally identifiable information (PII) :param text: :type text: str

Returns

cleaned_text

Return type

str

datto.DataConnections module

class datto.DataConnections.NotebookConnections[source]

Bases: object

Convert between Jupyter notebooks & Python scripts

save_as_notebook(file_path)[source]

Opens a Python script and saves as a Jupyter notebook.

Parameters

file_path (str) –

save_as_script(file_path)[source]

Opens a Jupyter notebook file, cleans it, and saves as a Python script.

Parameters

file_path (str) –

class datto.DataConnections.S3Connections[source]

Bases: object

Interact with S3

load_from_s3(directory_path, object_name)[source]

Load a pickled object from s3. Note: The pickle module is not secure. Only unpickle data you trust/saved yourself.

Parameters
  • directory_path (str) – Starts with bucket name, slash any subdirectories

  • object_name (str) –

Return type

saved_object

save_to_s3(directory_path, object_to_save, object_name)[source]

Pickle and save an object to s3. Creates the folder specified if it does not yet exist.

Parameters
  • directory_path (str) – Starts with bucket name, slash any subdirectories

  • object_to_save (any object with a type that can be pickled) –

  • object_name (str) –

Return type

None

class datto.DataConnections.SQLConnections(dbname=None, host=None, port=None, user=None, password=None)[source]

Bases: object

Connect with a SQL database

run_sql_redshift(query)[source]

Pandas doesn’t integrate with Redshift directly. Instead use psycopg2 to connect and transform results into a DataFrame manually.

Parameters
  • conn (cursor from database connection) –

  • query (str) –

Returns

df

Return type

DataFrame

class datto.DataConnections.SlackConnections(slack_api_token=None)[source]

Bases: object

Retrieve Slack messages

get_messages(channels, remove_bot_messages=True, excluded_user_ids=[], messages_limit=inf)[source]

Get messages from a given Slack channel(s)

Parameters
  • channels (list) –

  • remove_bot_messages (bool) –

  • excluded_user_ids (list) –

  • messages_limit (int) – Default is to fetch all messages

Returns

df

Return type

DataFrame

datto.Eda module

class datto.Eda.Eda[source]

Bases: object

Exploratory data analysis (EDA)

bar_graphs_by_col(df, path='../images/', group_by_var=None)[source]

Makes a bar graph for each categorical column.

Parameters
  • df (DataFrame) –

  • path (str) –

  • group_by_var (str) – Variable to group bar graphs by

check_for_mistyped_cols(numerical_vals, categorical_vals)[source]

Check for columns coded incorrectly

Parameters
  • numerical_vals (list) –

  • categorical_vals (list) –

Returns

mistyped_cols

Return type

list

check_unique_by_identifier_col(df, identifier_col)[source]

Check if there are duplicates by entity (e.g. user, item).

Parameters

df (DataFrame) –

Returns

dup_rows

Return type

DataFrame

find_cols_to_exclude(df)[source]

Returns columns that may not be helpful for model building.

Exclusion criteria: - Possible PII (address, name, username, date, etc. in col name) - Large proportion of nulls - Only 1 value in entire col - Dates - Low variance in col values - Large number of categorical values

Parameters

df (DataFrame) –

Returns

lst

Return type

list

find_correlated_features(df)[source]

Find & sort correlated features

Parameters

df (DataFrame) –

Returns

s

Return type

Series

sample_unique_vals(df)[source]

Examine a few unique vals in each column

Parameters

df (DataFrame) –

separate_cols_by_type(df)[source]

Split the DataFrame into two groups by type

Parameters

df (DataFrame) –

Returns

  • numerical_vals (DataFrame)

  • categorical_vals (DataFrame)

violin_plots_by_col(df, path='../images/', group_by_var=None)[source]

Makes a violin plot for each numerical column.

Parameters
  • df (DataFrame) –

  • path (str) –

  • group_by_var (str) – Variable to group violin plots by

datto.Experiments module

class datto.Experiments.Experiments[source]

Bases: object

Design & run experiments

assign_condition_by_id(user_id, conditions, proportions_by_conditions, random_state)[source]

Assign a given id to the same experimental condition every time for a consistent user experience. I.e. customer #15 will always be in the treatment condition.

Parameters
  • id (int) –

  • conditions (numpy array) – E.g. [‘treatment’, ‘control’]

  • proportions_by_conditions (numpy array) – Should add up to 1, e.g. [0.8, 0.2]

  • random_state (int) – Divisor used for consistent assignment

Returns

chosen_condition – Chooses one of the conditions you provided

Return type

str

datto.FrontEnd module

class datto.FrontEnd.FrontEnd[source]

Bases: object

Automatically generate HTML

dataframe_to_html(df, title='')[source]

Write an entire dataframe to an HTML file with nice formatting.

Parameters
  • df (DataFrame) –

  • title (str (optional)) –

Returns

html

Return type

str

dropdown_from_dataframe(name, df, chosen_col, width=None, class_name=None)[source]

Create text to use for rendering an HTML dropdown from a DataFrame.

Render by using {{ df|safe }} in your HTML file.

Parameters
  • name (str) – Name you’d like for the dropdown

  • df (DataFrame) –

  • chosen_col (str) – Which column’s values will populate the dropdown

  • width (str) – Width in pixels for the generated dropdown

  • class_name (str) – Name for class; used in order to create custom CSS

Returns

html_choices – String you can use to render HTML

Return type

str

fig_to_html(fig)[source]

Create HTML file from a matplotlib fig with workarounds for using inside a Flask app.

Parameters

fig (matplotlib figure) –

Returns

html

Return type

str

datto.ModelResults module

class datto.ModelResults.ModelResults[source]

Bases: object

Evaluate model performance & explore output

coefficients_graph(X_train, X_test, model, model_type, filename='shap_graph', path='../images/', multiclass=False, y_test=None)[source]

Displays graph of feature importances.

  • Number of horizontal axis indicates magnitude of effect on

    target variable (e.g. affected by 0.25)

  • Red/blue indicates feature value (increasing or decreasing feature

    has _ effect)

  • Blue & red mixed together indicate there isn’t a clear

    effect on the target variable

  • For classification - interpreting magnitude number / x axis - changes the

    predicted probability of y on average by _ percentage points (axis value * 100)

Parameters
  • X_train (pd.DataFrame) –

  • X_test (pd.DataFrame) –

  • model (fit model object) –

  • model_type (str) – ‘classification’ or ‘regression’

  • filename (str) –

  • multiclass (bool) –

  • y_test (pd.DataFrame) – Only needed for multiclass models

coefficients_individual_predictions(model, df, X_train, X_test, id_col, num_id_examples, num_feature_examples, model_type, class_names=['False', 'True'], path='../images/')[source]

Uses LIME to inspect an individual prediction and the features that influenced that prediction.

Parameters
  • model (sklearn model) –

  • df (pd.DataFrame) – Used for getting ids since they aren’t typically in training data

  • X_train (pd.DataFrame) –

  • X_test (pd.DataFrame) –

  • id_col (str) –

  • num_id_examples (int) –

  • num_feature_examples (int) –

  • model_type (str) – ‘classification’ or ‘regression’

  • class_names (str) –

  • path (str) –

Returns

features

Return type

list

coefficients_summary(X, y, num_repetitions, num_coefficients, model_type, multiclass=False)[source]

Prints average coefficient values using a regression model.

Parameters
  • X (DataFrame) –

  • y (DataFrame) –

  • num_repetitions (int) – Number of times to create models

  • num_coefficients (int) – Number of top coefficients to display

  • model_type (str) – ‘classification’ or ‘regression’

  • multiclass (bool) –

Returns

simplified_df – Has mean, median, and standard deviation for coefficients after several runs

Return type

DataFrame

get_tree_diagram(model, X_train, path='../images/')[source]

Save a diagram of a trained DecisionTree model

Parameters
  • model (sklearn model (trained)) –

  • X_train (pd.DataFrame) –

  • path (str) –

most_common_words_by_group(X, text_col_name, group_col_name, num_examples, num_times_min, min_ngram)[source]

Get the most commons phrases for defined groups.

Parameters
  • X (DataFrame) –

  • text_col_name (str) –

  • group_col_name (str) –

  • num_examples (int) – Number of text examples to include per group

  • num_times_min (int) – Minimum number of times word/phrase must appear in texts

  • min_ngram (int) –

Returns

overall_counts_df – Has groups, top words, and counts

Return type

DataFrame

most_similar_texts(X, text_column_name, chosen_num_topics=None, chosen_stopwords={}, exclude_numbers=False, exclude_times=False, exclude_months=False, exclude_weekdays=False, exclude_greetings_goodbyes=False, exclude_adverbs=False, num_examples=15, min_df=3, max_df=0.1, min_ngrams=1, max_ngrams=3)[source]

Uses NMF clustering to create n topics based on adjusted word frequencies

Parameters
  • X (DataFrame) –

  • num_examples (int) –

  • text_column_name (str) –

  • chosen_num_topics (int) – Optional - if none algorithm will determine best number

  • chosen_stopwords (set) – Option to add in your own unique stopwords

  • exclude_numbers (bool) – Adding numbers 0-3000 (with & without commas) as additional stopwords

  • exclude_times (bool) – Adding times as additional stopwords (e.g. 8:00)

  • exclude_months (bool) – Adding month names as additional stopwords

  • exclude_weekdays (bool) – Adding weekday names as additional stopwords

  • exclude_greetings_goodbyes (bool) – Adding common greetings & goodbyes as additional stopwords (e.g. hello)

  • exclude_adverbs (bool) – Adding common adverbs as additional stopwords (e.g. especially)

  • min_df (float) – Minimum number/proportion of docs that need to have the words

  • max_df (float) – Maximum number/proportion of docs that can have the words

  • min_ngrams (int) – Minimum number of words needed in phrases found

  • max_ngrams (int) – Maximum number of words in phrases found

Returns

  • all_topics (DataFrame) – Top n words/phrases per topic

  • original_with_keywords (DataFrame) – Original text with topic number assigned to each

  • model (NMF model)

score_final_model(model_type, X_test, y_test, trained_model, csv_file_name='final_model_results', multiclass=False)[source]

Score your model on the test dataset. Only run this once to get an idea of how your model will perform in realtime. Run it after you have chosen your model & parameters to avoid problems with overfitting.

Parameters
  • model_type (str) –

  • X_test (DataFrame) –

  • y_test (DataFrame) –

  • trained_model (sklearn model) –

  • multiclass (bool) –

  • csv_file_name (str) –

Returns

  • model (model) – Fit model

  • y_predicted (array)

datto.SetupMethods module

class datto.SetupMethods.SetupMethods[source]

Bases: object

Set up coding environment

display_more_data(num_to_display)[source]

Overrides Pandas and Numpy settings to display a larger amount of data instead of only a subset.

Parameters

num_to_display (int) – How many rows/columns to display

setup_logger()[source]
Returns

logger

Return type

jsonlogger

datto.TrainModel module

class datto.TrainModel.TrainModel[source]

Bases: object

Select & train models

model_testing(X_train, y_train, model_type, tie_breaker_scoring_method, save_to_csv=True, file_name='gridsearching_results', multiclass=False)[source]

Gridsearches using a model for best models/params out of a list of commonly used

Parameters
  • X_train (DataFrame) –

  • y_train (DataFrame) –

  • model (sklearn model) –

  • model_type (str) – ‘classification’ or ‘regression’

  • tie_breaker_scoring_method (str) – For classification: “precision”, “recall”, or “f1” For regression: “neg_root_mean_squared_error”, “neg_median_absolute_error”, or “r2”

  • save_to_csv (bool) –

  • file_name (str) –

  • multiclass (bool) –

Returns

best_params

Return type

dict

run_feature_selection(X_train, y_train, k, is_multiclass)[source]

Run SelectKBest feature selection for given datasets. Implements a custom method of feature selection for multiclass targets.

Parameters
  • X_train (DataFrame) –

  • y_train (DataFrame) –

  • k (int) –

  • is_multiclass (bool) –

Returns

cols_to_keep

Return type

list

train_in_chunks(X_train, y_train, model_type, is_multiclass, chunk_sizes=500000)[source]

For large datasets, train model in managable chunk sizes

Parameters
  • X_train (DataFrame) –

  • y_train (DataFrame) –

  • model_type (str) – ‘classification’ or ‘regression’

  • is_multiclass (bool) –

  • chunk_sizes (int) –

Returns

model

Return type

sklearn model

train_test_split_by_ids(df, id_col, target_col, prop_train)[source]
Parameters
  • df (DataFrame) –

  • id_col (str) –

  • target_col (str) –

  • prop_train (float) –

Returns

  • X_train (DataFrame)

  • y_train (DataFrame)

  • X_test (DataFrame)

  • y_test (DataFrame)