Example usage

py_predpurchase can be used to:

Apply preprocessing transformations to the data, including scaling, encoding, and passing through features as specified.
Calculate the cross validation results for a four common off-the-shelf models (Dummy, KNN, SVM and RandomForests)
Fit a given model, and extract feature importances, sorted in descending order, and returns them as a DataFrame.
Calculate the classification metrics for model predictions including precision, recall, accuracy and F1 scores.

Here, we will demonstrate each of those functionalities:

import py_predpurchase

print(py_predpurchase.__version__)

0.2.0

Imports

# Importing packages needed for the functions:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Importing functions
from py_predpurchase.function_preprocessing import numerical_categorical_preprocess
from py_predpurchase.function_model_cross_val import model_cross_validation
from py_predpurchase.function_feature_importance import get_feature_importances 
from py_predpurchase.function_classification_metrics import calculate_classification_metrics

Creating dummy objects to give to the function

Note: this is a demonstration, when using the package you will have your own objects (dataframes, models, hyperparameter values) to pass through. For the different functions, we have different kinds of dummy data. This is because these functions do not cover the entire flow of analysis, therefore the output of one function may not necessarily be the direct input of the next function.

Preprocessing

Given a dataset with both categorical and numerical data, you can use the function numerical_categorical_preprocess to preprocess all features, making them at a format that is compatible with most machine learning models.

# Creating a dummy dataset with both categorical and numerical data
data = {
    'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
    'Boolean': [True, False, True, False, True, False, True, False],
    'Numerical_1': np.random.rand(8),  # 8 random float numbers
    'Numerical_2': np.random.randint(1, 100, 8)  # 8 random integers between 1 and 100
}

df = pd.DataFrame(data)

# performing a train/test split 
X = df.drop('Numerical_2', axis=1)
y = df['Numerical_2']


# test_size=0.25, 0.75 for train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# defining numerical and categorical features
numeric_features = ["Numerical_1"]
categorical_features = ["Category", "Boolean"]

# applying the numerical_categorical_preprocess function
preprocessed_data = numerical_categorical_preprocess(
    X_train, 
    X_test, 
    y_train,
    y_test,
    numeric_features, 
    categorical_features
)

Cross Validation

The model_cross_validation function Calculates the cross validation results for a four common off-the-shelf models (Dummy, KNN, SVM and RandomForests) using preprocessed and cleaned training and testing datasets. Random forests and Dummy hyperparameters are fixed for simplicity sake.

# creating a dummy dataset

train_data = pd.DataFrame({
    "feature1": [1, 2, 3, 4, 5, 2, 2, 3, 3, 3, 3],
    "feature2": [5, 4, 3, 2, 1, 4, 0, 5, 5, 5, 5],
    'target': [0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0]
})

test_data = pd.DataFrame({
    "feature1": [1, 2, 3, 4, 5, 2, 2, 3, 3, 3, 3],
    "feature2": [5, 4, 3, 2, 1, 4, 0, 5, 5, 5, 5],
    'target': [0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0]
})

# defining dummy hyperparameters
target = "target"
k = 5
gamma = 10

cross_val_results = model_cross_validation(train_data, 
                                           test_data, 
                                           target, 
                                           k, 
                                           gamma)

pd.DataFrame(cross_val_results)

	dummy	knn	SVM	random_forest
fit_time	0.001 (+/- 0.000)	0.001 (+/- 0.000)	0.002 (+/- 0.000)	0.056 (+/- 0.001)
score_time	0.001 (+/- 0.000)	0.003 (+/- 0.000)	0.001 (+/- 0.000)	0.003 (+/- 0.000)
test_score	0.533 (+/- 0.075)	0.467 (+/- 0.075)	0.367 (+/- 0.217)	0.367 (+/- 0.217)
train_score	0.544 (+/- 0.025)	0.569 (+/- 0.084)	0.750 (+/- 0.048)	0.750 (+/- 0.048)

Feature Importance

Given an X and y (explanatory and target features) dataframe, the function get_feature_importances fits the model, extracts feature importances, sorts them, and returns them as a DataFrame.

# creating a dummy mdoel
model = RandomForestClassifier(max_depth=2, random_state=0)

#dummy data
X_data = pd.DataFrame({
    "feature1": [1, 2, 3, 4, 5, 2, 2, 3, 3, 3, 3],
    "feature2": [5, 4, 3, 2, 1, 4, 0, 5, 5, 5, 5],
})

y_data = [0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0]

# fitting the dummy model

model.fit(X_data, y_data)

X_columns = ["feature1", "feature2"]

get_feature_importances(model, X_columns)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 18
     14 model.fit(X_data, y_data)
     16 X_columns = ["feature1", "feature2"]
---> 18 get_feature_importances(model, X_columns)

File ~/checkouts/readthedocs.org/user_builds/py-predpurchase/checkouts/latest/src/py_predpurchase/function_feature_importance.py:38, in get_feature_importances(model, X_columns)
     35     raise ValueError("This model does not have the 'feature_importances_' attribute. Make sure your model is tree-based, and has been fitted on the data.")
     37 # check if X_columns is not empty
---> 38 if X_columns.empty:
     39     raise ValueError("The list of feature names (X_columns) cannot be empty.")
     41 # obtaining feature importances

AttributeError: 'list' object has no attribute 'empty'

Classification Metrics

Given the true value and the predictive target value (output from a chosen model’s prediction), the calculate_classification_metrics function calculates classification metrics for model predictions including precision, recall, accuracy and F1 scores.

# dummy data

y_true = [1,0,1,1,1,0,0,1,0,1]
y_pred = [1,1,1,0,1,0,0,1,0,0]

calculate_classification_metrics(y_true, y_pred)

{'Precision': 0.7200000000000001,
 'Recall': 0.7,
 'Accuracy': 0.7,
 'F1 Score': 0.7030303030303029}