datto.CleanDataframe

class datto.CleanDataframe[source]

Clean data using NLP, regex, calculations, etc.

__init__()

Methods

__init__()

batch_merge_operation(df_1, df_2, ...)

Merge two Pandas DataFrame in chunks for faster processing.

batch_pandas_operation(df, num_splits, ...)

Use a function on a Pandas DataFrame in chunks for faster processing.

clean_column_names(df)

Rename all columns to use underscores to reference columns without bracket formatting

compress_df(df)

Compresses each dataframe column as much as possible depending on type and values.

fill_nulls_using_regression_model(X_train, ...)

Trains a regression model on non-null data and predicts values to fill in nulls

fix_col_data_type(df, col, desired_dt)

Change column datatype using the best method for each type.

lematize(text)

param text

make_uuid(id_num)

Make a UUid_num from a text string

remove_duplicate_columns(df)

Remove columns with the same name

remove_email_greetings_signatures(text)

In order to obtain the main text of an email only, this method removes greetings, signoffs, and signatures by identifying sentences with less than 5% verbs to drop.

remove_links(text)

param text

remove_names(text)

param text

remove_pii(text)

Remove common patterns of personally identifiable information (PII) :param text: :type text: str