datto package
datto.CleanDataframe module
- class datto.CleanDataframe.CleanDataframe[source]
Bases:
object
Clean data using NLP, regex, calculations, etc.
- batch_merge_operation(df_1, df_2, num_splits, identifier_col, merge_col)[source]
Merge two Pandas DataFrame in chunks for faster processing.
- Parameters
df_1 (DataFrame) –
df_2 (DataFrame) –
num_splits (int) –
identifier_col (str) –
merge_col (str) –
- Returns
new_df
- Return type
DataFrame
- batch_pandas_operation(df, num_splits, identifier_col, func)[source]
Use a function on a Pandas DataFrame in chunks for faster processing.
- Parameters
df (DataFrame) –
num_splits (int) –
identifier_col (str) –
func (Function) –
- Returns
new_df
- Return type
DataFrame
- clean_column_names(df)[source]
Rename all columns to use underscores to reference columns without bracket formatting
- Parameters
df (DataFrame) –
- Returns
df
- Return type
DataFrame
- compress_df(df)[source]
Compresses each dataframe column as much as possible depending on type and values.
- Parameters
df (DataFrame) –
- Returns
df
- Return type
DataFrame
- fill_nulls_using_regression_model(X_train, X_test)[source]
Trains a regression model on non-null data and predicts values to fill in nulls
- Parameters
X_train (pd.DataFrame) –
X_test (pd.DataFrme) –
- Returns
X_train (pd.DataFrame)
X_test (pd.DataFrame)
- fix_col_data_type(df, col, desired_dt)[source]
Change column datatype using the best method for each type.
- Parameters
df (DataFrame) –
col (str) – Column to change the dtype for
desired_dt (str) – {‘float’, ‘int’, ‘datetime’, ‘str’}
- Returns
df
- Return type
DataFrame
- make_uuid(id_num)[source]
Make a UUid_num from a text string
- Parameters
id_num (str) –
- Returns
uuid
- Return type
str
- remove_duplicate_columns(df)[source]
Remove columns with the same name
- Parameters
df (DataFrame) –
- Returns
df
- Return type
DataFrame
- remove_email_greetings_signatures(text)[source]
In order to obtain the main text of an email only, this method removes greetings, signoffs, and signatures by identifying sentences with less than 5% verbs to drop. Does not replace links.
Inspiration from: https://github.com/mynameisvinn/EmailParser
- Parameters
text (str) –
- Returns
text
- Return type
str
datto.DataConnections module
- class datto.DataConnections.NotebookConnections[source]
Bases:
object
Convert between Jupyter notebooks & Python scripts
- class datto.DataConnections.S3Connections[source]
Bases:
object
Interact with S3
- load_from_s3(directory_path, object_name)[source]
Load a pickled object from s3. Note: The pickle module is not secure. Only unpickle data you trust/saved yourself.
- Parameters
directory_path (str) – Starts with bucket name, slash any subdirectories
object_name (str) –
- Return type
saved_object
- save_to_s3(directory_path, object_to_save, object_name)[source]
Pickle and save an object to s3. Creates the folder specified if it does not yet exist.
- Parameters
directory_path (str) – Starts with bucket name, slash any subdirectories
object_to_save (any object with a type that can be pickled) –
object_name (str) –
- Return type
None
- class datto.DataConnections.SQLConnections(dbname=None, host=None, port=None, user=None, password=None)[source]
Bases:
object
Connect with a SQL database
- class datto.DataConnections.SlackConnections(slack_api_token=None)[source]
Bases:
object
Retrieve Slack messages
- get_messages(channels, remove_bot_messages=True, excluded_user_ids=[], messages_limit=inf)[source]
Get messages from a given Slack channel(s)
- Parameters
channels (list) –
remove_bot_messages (bool) –
excluded_user_ids (list) –
messages_limit (int) – Default is to fetch all messages
- Returns
df
- Return type
DataFrame
datto.Eda module
- class datto.Eda.Eda[source]
Bases:
object
Exploratory data analysis (EDA)
- bar_graphs_by_col(df, path='../images/', group_by_var=None)[source]
Makes a bar graph for each categorical column.
- Parameters
df (DataFrame) –
path (str) –
group_by_var (str) – Variable to group bar graphs by
- check_for_mistyped_cols(numerical_vals, categorical_vals)[source]
Check for columns coded incorrectly
- Parameters
numerical_vals (list) –
categorical_vals (list) –
- Returns
mistyped_cols
- Return type
list
- check_unique_by_identifier_col(df, identifier_col)[source]
Check if there are duplicates by entity (e.g. user, item).
- Parameters
df (DataFrame) –
- Returns
dup_rows
- Return type
DataFrame
- find_cols_to_exclude(df)[source]
Returns columns that may not be helpful for model building.
Exclusion criteria: - Possible PII (address, name, username, date, etc. in col name) - Large proportion of nulls - Only 1 value in entire col - Dates - Low variance in col values - Large number of categorical values
- Parameters
df (DataFrame) –
- Returns
lst
- Return type
list
Find & sort correlated features
- Parameters
df (DataFrame) –
- Returns
s
- Return type
Series
- sample_unique_vals(df)[source]
Examine a few unique vals in each column
- Parameters
df (DataFrame) –
datto.Experiments module
- class datto.Experiments.Experiments[source]
Bases:
object
Design & run experiments
- assign_condition_by_id(user_id, conditions, proportions_by_conditions, random_state)[source]
Assign a given id to the same experimental condition every time for a consistent user experience. I.e. customer #15 will always be in the treatment condition.
- Parameters
id (int) –
conditions (numpy array) – E.g. [‘treatment’, ‘control’]
proportions_by_conditions (numpy array) – Should add up to 1, e.g. [0.8, 0.2]
random_state (int) – Divisor used for consistent assignment
- Returns
chosen_condition – Chooses one of the conditions you provided
- Return type
str
datto.FrontEnd module
- class datto.FrontEnd.FrontEnd[source]
Bases:
object
Automatically generate HTML
- dataframe_to_html(df, title='')[source]
Write an entire dataframe to an HTML file with nice formatting.
- Parameters
df (DataFrame) –
title (str (optional)) –
- Returns
html
- Return type
str
- dropdown_from_dataframe(name, df, chosen_col, width=None, class_name=None)[source]
Create text to use for rendering an HTML dropdown from a DataFrame.
Render by using {{ df|safe }} in your HTML file.
- Parameters
name (str) – Name you’d like for the dropdown
df (DataFrame) –
chosen_col (str) – Which column’s values will populate the dropdown
width (str) – Width in pixels for the generated dropdown
class_name (str) – Name for class; used in order to create custom CSS
- Returns
html_choices – String you can use to render HTML
- Return type
str
datto.ModelResults module
- class datto.ModelResults.ModelResults[source]
Bases:
object
Evaluate model performance & explore output
- coefficients_graph(X_train, X_test, model, model_type, filename='shap_graph', path='../images/', multiclass=False, y_test=None)[source]
Displays graph of feature importances.
- Number of horizontal axis indicates magnitude of effect on
target variable (e.g. affected by 0.25)
- Red/blue indicates feature value (increasing or decreasing feature
has _ effect)
- Blue & red mixed together indicate there isn’t a clear
effect on the target variable
- For classification - interpreting magnitude number / x axis - changes the
predicted probability of y on average by _ percentage points (axis value * 100)
- Parameters
X_train (pd.DataFrame) –
X_test (pd.DataFrame) –
model (fit model object) –
model_type (str) – ‘classification’ or ‘regression’
filename (str) –
multiclass (bool) –
y_test (pd.DataFrame) – Only needed for multiclass models
- coefficients_individual_predictions(model, df, X_train, X_test, id_col, num_id_examples, num_feature_examples, model_type, class_names=['False', 'True'], path='../images/')[source]
Uses LIME to inspect an individual prediction and the features that influenced that prediction.
- Parameters
model (sklearn model) –
df (pd.DataFrame) – Used for getting ids since they aren’t typically in training data
X_train (pd.DataFrame) –
X_test (pd.DataFrame) –
id_col (str) –
num_id_examples (int) –
num_feature_examples (int) –
model_type (str) – ‘classification’ or ‘regression’
class_names (str) –
path (str) –
- Returns
features
- Return type
list
- coefficients_summary(X, y, num_repetitions, num_coefficients, model_type, multiclass=False)[source]
Prints average coefficient values using a regression model.
- Parameters
X (DataFrame) –
y (DataFrame) –
num_repetitions (int) – Number of times to create models
num_coefficients (int) – Number of top coefficients to display
model_type (str) – ‘classification’ or ‘regression’
multiclass (bool) –
- Returns
simplified_df – Has mean, median, and standard deviation for coefficients after several runs
- Return type
DataFrame
- get_tree_diagram(model, X_train, path='../images/')[source]
Save a diagram of a trained DecisionTree model
- Parameters
model (sklearn model (trained)) –
X_train (pd.DataFrame) –
path (str) –
- most_common_words_by_group(X, text_col_name, group_col_name, num_examples, num_times_min, min_ngram)[source]
Get the most commons phrases for defined groups.
- Parameters
X (DataFrame) –
text_col_name (str) –
group_col_name (str) –
num_examples (int) – Number of text examples to include per group
num_times_min (int) – Minimum number of times word/phrase must appear in texts
min_ngram (int) –
- Returns
overall_counts_df – Has groups, top words, and counts
- Return type
DataFrame
- most_similar_texts(X, text_column_name, chosen_num_topics=None, chosen_stopwords={}, exclude_numbers=False, exclude_times=False, exclude_months=False, exclude_weekdays=False, exclude_greetings_goodbyes=False, exclude_adverbs=False, num_examples=15, min_df=3, max_df=0.1, min_ngrams=1, max_ngrams=3)[source]
Uses NMF clustering to create n topics based on adjusted word frequencies
- Parameters
X (DataFrame) –
num_examples (int) –
text_column_name (str) –
chosen_num_topics (int) – Optional - if none algorithm will determine best number
chosen_stopwords (set) – Option to add in your own unique stopwords
exclude_numbers (bool) – Adding numbers 0-3000 (with & without commas) as additional stopwords
exclude_times (bool) – Adding times as additional stopwords (e.g. 8:00)
exclude_months (bool) – Adding month names as additional stopwords
exclude_weekdays (bool) – Adding weekday names as additional stopwords
exclude_greetings_goodbyes (bool) – Adding common greetings & goodbyes as additional stopwords (e.g. hello)
exclude_adverbs (bool) – Adding common adverbs as additional stopwords (e.g. especially)
min_df (float) – Minimum number/proportion of docs that need to have the words
max_df (float) – Maximum number/proportion of docs that can have the words
min_ngrams (int) – Minimum number of words needed in phrases found
max_ngrams (int) – Maximum number of words in phrases found
- Returns
all_topics (DataFrame) – Top n words/phrases per topic
original_with_keywords (DataFrame) – Original text with topic number assigned to each
model (NMF model)
- score_final_model(model_type, X_test, y_test, trained_model, csv_file_name='final_model_results', multiclass=False)[source]
Score your model on the test dataset. Only run this once to get an idea of how your model will perform in realtime. Run it after you have chosen your model & parameters to avoid problems with overfitting.
- Parameters
model_type (str) –
X_test (DataFrame) –
y_test (DataFrame) –
trained_model (sklearn model) –
multiclass (bool) –
csv_file_name (str) –
- Returns
model (model) – Fit model
y_predicted (array)
datto.SetupMethods module
datto.TrainModel module
- class datto.TrainModel.TrainModel[source]
Bases:
object
Select & train models
- model_testing(X_train, y_train, model_type, tie_breaker_scoring_method, save_to_csv=True, file_name='gridsearching_results', multiclass=False)[source]
Gridsearches using a model for best models/params out of a list of commonly used
- Parameters
X_train (DataFrame) –
y_train (DataFrame) –
model (sklearn model) –
model_type (str) – ‘classification’ or ‘regression’
tie_breaker_scoring_method (str) – For classification: “precision”, “recall”, or “f1” For regression: “neg_root_mean_squared_error”, “neg_median_absolute_error”, or “r2”
save_to_csv (bool) –
file_name (str) –
multiclass (bool) –
- Returns
best_params
- Return type
dict
- run_feature_selection(X_train, y_train, k, is_multiclass)[source]
Run SelectKBest feature selection for given datasets. Implements a custom method of feature selection for multiclass targets.
- Parameters
X_train (DataFrame) –
y_train (DataFrame) –
k (int) –
is_multiclass (bool) –
- Returns
cols_to_keep
- Return type
list
- train_in_chunks(X_train, y_train, model_type, is_multiclass, chunk_sizes=500000)[source]
For large datasets, train model in managable chunk sizes
- Parameters
X_train (DataFrame) –
y_train (DataFrame) –
model_type (str) – ‘classification’ or ‘regression’
is_multiclass (bool) –
chunk_sizes (int) –
- Returns
model
- Return type
sklearn model