datto package

datto.CleanDataframe module

class datto.CleanDataframe.CleanDataframe[source]

Bases: object

Clean data using NLP, regex, calculations, etc.

batch_merge_operation(df_1, df_2, num_splits, identifier_col, merge_col)[source]

Merge two Pandas DataFrame in chunks for faster processing.

Parameters

df_1 (DataFrame) –
df_2 (DataFrame) –
num_splits (int) –
identifier_col (str) –
merge_col (str) –

Returns

new_df

Return type

DataFrame

batch_pandas_operation(df, num_splits, identifier_col, func)[source]

Use a function on a Pandas DataFrame in chunks for faster processing.

Parameters

df (DataFrame) –
num_splits (int) –
identifier_col (str) –
func (Function) –

Returns

new_df

Return type

DataFrame

clean_column_names(df)[source]

Rename all columns to use underscores to reference columns without bracket formatting

Parameters: df (DataFrame) –
Returns: df
Return type: DataFrame

compress_df(df)[source]

Compresses each dataframe column as much as possible depending on type and values.

Parameters: df (DataFrame) –
Returns: df
Return type: DataFrame

fill_nulls_using_regression_model(X_train, X_test)[source]

Trains a regression model on non-null data and predicts values to fill in nulls

Parameters

X_train (pd.DataFrame) –
X_test (pd.DataFrme) –

Returns

X_train (pd.DataFrame)
X_test (pd.DataFrame)

fix_col_data_type(df, col, desired_dt)[source]

Change column datatype using the best method for each type.

Parameters

df (DataFrame) –
col (str) – Column to change the dtype for
desired_dt (str) – {‘float’, ‘int’, ‘datetime’, ‘str’}

Returns

df

Return type

DataFrame

lematize(text)[source]

Parameters: text (str) –
Return type: list of spacy tokens

make_uuid(id_num)[source]

Make a UUid_num from a text string

Parameters: id_num (str) –
Returns: uuid
Return type: str

remove_duplicate_columns(df)[source]

Remove columns with the same name

Parameters: df (DataFrame) –
Returns: df
Return type: DataFrame

remove_email_greetings_signatures(text)[source]

In order to obtain the main text of an email only, this method removes greetings, signoffs, and signatures by identifying sentences with less than 5% verbs to drop. Does not replace links.

Inspiration from: https://github.com/mynameisvinn/EmailParser

Parameters: text (str) –
Returns: text
Return type: str

remove_links(text)[source]

Parameters: text (str) –
Returns: cleaned_text
Return type: str

remove_names(text)[source]

Parameters: text (str) –
Returns: cleaned_text
Return type: str

remove_pii(text)[source]

Remove common patterns of personally identifiable information (PII) :param text: :type text: str

Returns: cleaned_text
Return type: str

datto.DataConnections module

class datto.DataConnections.NotebookConnections[source]

Bases: object

Convert between Jupyter notebooks & Python scripts

save_as_notebook(file_path)[source]

Opens a Python script and saves as a Jupyter notebook.

Parameters: file_path (str) –

save_as_script(file_path)[source]

Opens a Jupyter notebook file, cleans it, and saves as a Python script.

Parameters: file_path (str) –

class datto.DataConnections.S3Connections[source]

Bases: object

Interact with S3

load_from_s3(directory_path, object_name)[source]

Load a pickled object from s3. Note: The pickle module is not secure. Only unpickle data you trust/saved yourself.

Parameters

directory_path (str) – Starts with bucket name, slash any subdirectories
object_name (str) –

Return type

saved_object

save_to_s3(directory_path, object_to_save, object_name)[source]

Pickle and save an object to s3. Creates the folder specified if it does not yet exist.

Parameters

directory_path (str) – Starts with bucket name, slash any subdirectories
object_to_save (any object with a type that can be pickled) –
object_name (str) –

Return type

None

class datto.DataConnections.SQLConnections(dbname=None, host=None, port=None, user=None, password=None)[source]

Bases: object

Connect with a SQL database

run_sql_redshift(query)[source]

Pandas doesn’t integrate with Redshift directly. Instead use psycopg2 to connect and transform results into a DataFrame manually.

Parameters

conn (cursor from database connection) –
query (str) –

Returns

df

Return type

DataFrame

class datto.DataConnections.SlackConnections(slack_api_token=None)[source]

Bases: object

Retrieve Slack messages

get_messages(channels, remove_bot_messages=True, excluded_user_ids=[], messages_limit=inf)[source]

Get messages from a given Slack channel(s)

Parameters

channels (list) –
remove_bot_messages (bool) –
excluded_user_ids (list) –
messages_limit (int) – Default is to fetch all messages

Returns

df

Return type

DataFrame

datto.Eda module

class datto.Eda.Eda[source]

Bases: object

Exploratory data analysis (EDA)

bar_graphs_by_col(df, path='../images/', group_by_var=None)[source]

Makes a bar graph for each categorical column.

Parameters

df (DataFrame) –
path (str) –
group_by_var (str) – Variable to group bar graphs by

check_for_mistyped_cols(numerical_vals, categorical_vals)[source]

Check for columns coded incorrectly

Parameters

numerical_vals (list) –
categorical_vals (list) –

Returns

mistyped_cols

Return type

list

check_unique_by_identifier_col(df, identifier_col)[source]

Check if there are duplicates by entity (e.g. user, item).

Parameters: df (DataFrame) –
Returns: dup_rows
Return type: DataFrame

find_cols_to_exclude(df)[source]

Returns columns that may not be helpful for model building.

Exclusion criteria: - Possible PII (address, name, username, date, etc. in col name) - Large proportion of nulls - Only 1 value in entire col - Dates - Low variance in col values - Large number of categorical values

Parameters: df (DataFrame) –
Returns: lst
Return type: list

find_correlated_features(df)[source]

Find & sort correlated features

Parameters: df (DataFrame) –
Returns: s
Return type: Series

sample_unique_vals(df)[source]

Examine a few unique vals in each column

Parameters: df (DataFrame) –

separate_cols_by_type(df)[source]

Split the DataFrame into two groups by type

Parameters

df (DataFrame) –

Returns

numerical_vals (DataFrame)
categorical_vals (DataFrame)

violin_plots_by_col(df, path='../images/', group_by_var=None)[source]

Makes a violin plot for each numerical column.

Parameters

df (DataFrame) –
path (str) –
group_by_var (str) – Variable to group violin plots by

datto.Experiments module

class datto.Experiments.Experiments[source]

Bases: object

Design & run experiments

assign_condition_by_id(user_id, conditions, proportions_by_conditions, random_state)[source]

Assign a given id to the same experimental condition every time for a consistent user experience. I.e. customer #15 will always be in the treatment condition.

Parameters

id (int) –
conditions (numpy array) – E.g. [‘treatment’, ‘control’]
proportions_by_conditions (numpy array) – Should add up to 1, e.g. [0.8, 0.2]
random_state (int) – Divisor used for consistent assignment

Returns

chosen_condition – Chooses one of the conditions you provided

Return type

str

datto.FrontEnd module

class datto.FrontEnd.FrontEnd[source]

Bases: object

Automatically generate HTML

dataframe_to_html(df, title='')[source]

Write an entire dataframe to an HTML file with nice formatting.

Parameters

df (DataFrame) –
title (str (optional)) –

Returns

html

Return type

str

dropdown_from_dataframe(name, df, chosen_col, width=None, class_name=None)[source]

Create text to use for rendering an HTML dropdown from a DataFrame.

Render by using {{ df|safe }} in your HTML file.

Parameters

name (str) – Name you’d like for the dropdown
df (DataFrame) –
chosen_col (str) – Which column’s values will populate the dropdown
width (str) – Width in pixels for the generated dropdown
class_name (str) – Name for class; used in order to create custom CSS

Returns

html_choices – String you can use to render HTML

Return type

str

fig_to_html(fig)[source]

Create HTML file from a matplotlib fig with workarounds for using inside a Flask app.

Parameters: fig (matplotlib figure) –
Returns: html
Return type: str

datto.ModelResults module

class datto.ModelResults.ModelResults[source]

Bases: object

Evaluate model performance & explore output

coefficients_graph(X_train, X_test, model, model_type, filename='shap_graph', path='../images/', multiclass=False, y_test=None)[source]

Displays graph of feature importances.

Number of horizontal axis indicates magnitude of effect on
target variable (e.g. affected by 0.25)
Red/blue indicates feature value (increasing or decreasing feature
has _ effect)
Blue & red mixed together indicate there isn’t a clear
effect on the target variable
For classification - interpreting magnitude number / x axis - changes the
predicted probability of y on average by _ percentage points (axis value * 100)

Parameters

X_train (pd.DataFrame) –
X_test (pd.DataFrame) –
model (fit model object) –
model_type (str) – ‘classification’ or ‘regression’
filename (str) –
multiclass (bool) –
y_test (pd.DataFrame) – Only needed for multiclass models

coefficients_individual_predictions(model, df, X_train, X_test, id_col, num_id_examples, num_feature_examples, model_type, class_names=['False', 'True'], path='../images/')[source]

Uses LIME to inspect an individual prediction and the features that influenced that prediction.

Parameters

model (sklearn model) –
df (pd.DataFrame) – Used for getting ids since they aren’t typically in training data
X_train (pd.DataFrame) –
X_test (pd.DataFrame) –
id_col (str) –
num_id_examples (int) –
num_feature_examples (int) –
model_type (str) – ‘classification’ or ‘regression’
class_names (str) –
path (str) –

Returns

features

Return type

list

coefficients_summary(X, y, num_repetitions, num_coefficients, model_type, multiclass=False)[source]

Prints average coefficient values using a regression model.

Parameters

X (DataFrame) –
y (DataFrame) –
num_repetitions (int) – Number of times to create models
num_coefficients (int) – Number of top coefficients to display
model_type (str) – ‘classification’ or ‘regression’
multiclass (bool) –

Returns

simplified_df – Has mean, median, and standard deviation for coefficients after several runs

Return type

DataFrame

get_tree_diagram(model, X_train, path='../images/')[source]

Save a diagram of a trained DecisionTree model

Parameters

model (sklearn model (trained)) –
X_train (pd.DataFrame) –
path (str) –

most_common_words_by_group(X, text_col_name, group_col_name, num_examples, num_times_min, min_ngram)[source]

Get the most commons phrases for defined groups.

Parameters

X (DataFrame) –
text_col_name (str) –
group_col_name (str) –
num_examples (int) – Number of text examples to include per group
num_times_min (int) – Minimum number of times word/phrase must appear in texts
min_ngram (int) –

Returns

overall_counts_df – Has groups, top words, and counts

Return type

DataFrame

most_similar_texts(X, text_column_name, chosen_num_topics=None, chosen_stopwords={}, exclude_numbers=False, exclude_times=False, exclude_months=False, exclude_weekdays=False, exclude_greetings_goodbyes=False, exclude_adverbs=False, num_examples=15, min_df=3, max_df=0.1, min_ngrams=1, max_ngrams=3)[source]

Uses NMF clustering to create n topics based on adjusted word frequencies

Parameters

X (DataFrame) –
num_examples (int) –
text_column_name (str) –
chosen_num_topics (int) – Optional - if none algorithm will determine best number
chosen_stopwords (set) – Option to add in your own unique stopwords
exclude_numbers (bool) – Adding numbers 0-3000 (with & without commas) as additional stopwords
exclude_times (bool) – Adding times as additional stopwords (e.g. 8:00)
exclude_months (bool) – Adding month names as additional stopwords
exclude_weekdays (bool) – Adding weekday names as additional stopwords
exclude_greetings_goodbyes (bool) – Adding common greetings & goodbyes as additional stopwords (e.g. hello)
exclude_adverbs (bool) – Adding common adverbs as additional stopwords (e.g. especially)
min_df (float) – Minimum number/proportion of docs that need to have the words
max_df (float) – Maximum number/proportion of docs that can have the words
min_ngrams (int) – Minimum number of words needed in phrases found
max_ngrams (int) – Maximum number of words in phrases found

Returns

all_topics (DataFrame) – Top n words/phrases per topic
original_with_keywords (DataFrame) – Original text with topic number assigned to each
model (NMF model)

score_final_model(model_type, X_test, y_test, trained_model, csv_file_name='final_model_results', multiclass=False)[source]

Score your model on the test dataset. Only run this once to get an idea of how your model will perform in realtime. Run it after you have chosen your model & parameters to avoid problems with overfitting.

Parameters

model_type (str) –
X_test (DataFrame) –
y_test (DataFrame) –
trained_model (sklearn model) –
multiclass (bool) –
csv_file_name (str) –

Returns

model (model) – Fit model
y_predicted (array)

datto.SetupMethods module

class datto.SetupMethods.SetupMethods[source]

Bases: object

Set up coding environment

display_more_data(num_to_display)[source]

Overrides Pandas and Numpy settings to display a larger amount of data instead of only a subset.

Parameters: num_to_display (int) – How many rows/columns to display

setup_logger()[source]

Returns: logger
Return type: jsonlogger

datto.TrainModel module

class datto.TrainModel.TrainModel[source]

Bases: object

Select & train models

model_testing(X_train, y_train, model_type, tie_breaker_scoring_method, save_to_csv=True, file_name='gridsearching_results', multiclass=False)[source]

Gridsearches using a model for best models/params out of a list of commonly used

Parameters

X_train (DataFrame) –
y_train (DataFrame) –
model (sklearn model) –
model_type (str) – ‘classification’ or ‘regression’
tie_breaker_scoring_method (str) – For classification: “precision”, “recall”, or “f1” For regression: “neg_root_mean_squared_error”, “neg_median_absolute_error”, or “r2”
save_to_csv (bool) –
file_name (str) –
multiclass (bool) –

Returns

best_params

Return type

dict

run_feature_selection(X_train, y_train, k, is_multiclass)[source]

Run SelectKBest feature selection for given datasets. Implements a custom method of feature selection for multiclass targets.

Parameters

X_train (DataFrame) –
y_train (DataFrame) –
k (int) –
is_multiclass (bool) –

Returns

cols_to_keep

Return type

list

train_in_chunks(X_train, y_train, model_type, is_multiclass, chunk_sizes=500000)[source]

For large datasets, train model in managable chunk sizes

Parameters

X_train (DataFrame) –
y_train (DataFrame) –
model_type (str) – ‘classification’ or ‘regression’
is_multiclass (bool) –
chunk_sizes (int) –

Returns

model

Return type

sklearn model

train_test_split_by_ids(df, id_col, target_col, prop_train)[source]

Parameters

df (DataFrame) –
id_col (str) –
target_col (str) –
prop_train (float) –

Returns

X_train (DataFrame)
y_train (DataFrame)
X_test (DataFrame)
y_test (DataFrame)