2.3 Theoretical - Energy Prediction

#missingvalues #logisticregression #predict
You work for an energy company. From the cool smart energy meters in every customer’s home you can collect features, measured on 15 minute intervals. Your goal is to predict whether the energy usage is over their prepayment or under (i.e. too much used or too little used. You have a hand-coded label for each row in the data. Your dataset has 5000 columns and 2000 rows.
a) You want to perform logistic regression but it does not work when you have more columns than rows. What would your strategy be? Be specific about the steps you would take!
b) From your logistic regression model, you obtain the confusion matrix below. What is the accuracy? Is this high compared to the baseline accuracy?

  1. Data Cleaning:

    • Check for missing values and decide on a strategy to handle them, such as imputation, removing columns with too many missing values, or using algorithms that can handle missing data.
  2. Feature Engineering:

    • Aggregate or combine similar features.
    • Create interaction terms if they make sense in the context of the problem.
  3. Dimensionality Reduction:

    • Use techniques such as Principal Component Analysis (PCA) to reduce the dimensionality of the data. However, interpretability might be compromised with PCA.
    • Another approach is Linear Discriminant Analysis (LDA) which focuses on maximizing the separability among known categories.
  4. Feature Selection:

    • Use techniques like Recursive Feature Elimination (RFE) or feature importance from tree-based algorithms (like Random Forests) to rank and select the most relevant features.
    • Apply regularization techniques such as L1 regularization (Lasso regression) which can shrink some feature coefficients to zero, effectively selecting a subset of the features.
  5. Cross-Validation:

    • Especially when working with a smaller dataset, use cross-validation techniques like k-fold or stratified k-fold to partition your data for training and validation to prevent overfitting and get a more generalizable model.
  6. Modeling:

    • After reducing the feature space and selecting the most relevant features, you can perform logistic regression on the transformed dataset.

Given the confusion matrix:

sqlCopy code

        `true predicted | over | under ------------------------ over      | 1752 |  26 under     |  148 |  74`

Accuracy = (True Positives + True Negatives) / Total Predictions Accuracy = (1752 + 74) / (1752 + 26 + 148 + 74) Accuracy = 1826 / 2000 Accuracy = 0.913 or 91.3%

To determine if this accuracy is high compared to the baseline accuracy, we need to calculate the baseline accuracy.

Baseline Accuracy: It's the accuracy that a naive model would achieve by always predicting the most frequent class. In this dataset, if we always predicted "over," the baseline accuracy would be: Baseline Accuracy = Total 'over' observations / Total Observations Baseline Accuracy = (1752 + 148) / 2000 Baseline Accuracy = 1900 / 2000 Baseline Accuracy = 0.95 or 95%

Comparing the two, our logistic regression model's accuracy (91.3%) is lower than the baseline accuracy (95%). This indicates that, while the logistic regression model might be capturing some nuances in the data, its overall performance isn't exceeding a simple model that always predicts the majority class. It's crucial to evaluate other metrics (like precision, recall, or F1-score) and the costs associated with false positives and false negatives to decide the model's utility.