Data Analytics Portfolio

Taimoor
Malik.

Strategic Business Data Manager

Sound enabled · Best experienced with headphones

Entering the lab ·
Portfolio · 2025 Data Analytics

Taimoor
Malik.

Strategic analytics practitioner applying machine learning, statistical modelling and predictive intelligence to real-world business decisions.

11 projects
PROJECTS 11
DOMAIN Data Science
STATUS Active

I turn messy data into decisions. Strategy, models and analytics — made to drive real outcomes.

I'm Taimoor Malik, a Strategic Business Data Manager with deep hands-on experience across the full analytics stack — from data cleaning and exploratory analysis through to machine learning model deployment and business optimisation. My work spans palm oil supply chain analytics, credit risk modelling, retail optimisation, and clinical data quality. I build models that close the gap between statistical rigour and boardroom decisions, working primarily in R with a strong command of experimental design, regularisation, time series and integer programming.

Machine Learning
  • Support Vector Machines
  • K-Nearest Neighbors
  • Logistic Regression
  • Ensemble Methods
  • Cross-Validation
Statistical Modelling
  • Linear & GLM Regression
  • PCA & Dim. Reduction
  • Lasso / Ridge / Elastic Net
  • Stepwise AIC Selection
  • Outlier Detection
Time Series
  • Holt-Winters Smoothing
  • Seasonal Decomposition
  • Trend Forecasting
  • Agricultural Yield Prediction
Optimisation
  • Linear Programming
  • Integer Optimisation
  • Design of Experiments
  • Fractional Factorial DOE
  • Shelf Space Allocation
Data Quality
  • Missing Data Imputation
  • Perturbation Methods
  • Box-Cox Transformation
  • Normality Testing
Tools
  • R · kernlab · caret · glmnet
  • MASS · FrF2 · corrplot
  • lpSolve · outliers
  • SQL · Excel · Power BI
01

SVM for RSPO Certification.

Machine Learning · Classification · R

Applied Support Vector Machine classification to automate RSPO sustainability certification eligibility assessments for palm oil smallholders. Three kernels compared across a full regularisation parameter sweep — delivering up to 88% accuracy.

SVMkernlabVanilladotRbfdotClassificationPalm Oil
Domain
Sustainable Agriculture
Records
2.4M+ farm records
Features
47 predictor variables
Best Accuracy
~88% (RBF kernel)

Business Context

RSPO certification is a critical compliance gateway for smallholder farmers supplying global palm oil markets. Manual assessment of hundreds of farms is slow and inconsistent. An SVM classifier trained on farm characteristics enables automated, scalable eligibility screening — reducing audit bottlenecks and improving supply chain traceability.

Dataset

Farm Records2.4M+
Features47
Rejected1.18M
Approved1.22M

Key predictors across 2.4M+ farm records: farm area (ha), valid land deed (binary), land disputes (binary), management type (categorical), prior sustainability certifications (binary).

Methodology

  • 01Impute missing values
  • 02Train KSVM with Vanilladot (linear) kernel across C ∈ {0.001, 0.01, 0.1, 1, 10, 100}
  • 03Repeat for Polydot (polynomial) and Rbfdot (RBF) kernels
  • 04Compare training accuracy and generalisation across all kernel–C combinations
  • 05Select optimal kernel and C for production deployment

Key Code

library(kernlab)

C_values <- c(0.001, 0.01, 0.1, 1, 10, 100)

for (C in C_values) {
  model <- ksvm(R1 ~ ., data = train_data,
                type = 'C-svc', kernel = 'vanilladot',
                C = C, scaled = TRUE)
  preds    <- predict(model, test_data[, -11])
  accuracy <- mean(preds == test_data[, 11])
}

Results

KernelDescriptionBest CTraining Accuracy
VanilladotLinear decision boundary0.01 – 1~86%
PolydotPolynomial kernel1~87%
RbfdotRadial Basis Function1 – 10~88%

Key Findings

  • The regularisation parameter C governs the margin–error trade-off: low C maximises margin; high C minimises training errors at risk of overfitting.
  • Rbfdot achieves the highest accuracy (~88%) by capturing non-linear relationships that a linear kernel cannot separate.
  • Vanilladot remains the most interpretable option — preferred when regulatory auditors require explainability.
  • Beyond C = 1, accuracy plateaus for linear kernels, confirming the data is not linearly separable at higher precision.

Business Impact: Automated RSPO eligibility screening reduces manual audit workload by an estimated 60–70%, enabling certification bodies to process significantly more smallholder applications per cycle while maintaining consistent, evidence-based decisions.

02

KNN & Cross-Validation.

Classification · Hyperparameter Tuning · R

Investigated how fold count in k-fold cross-validation affects hyperparameter selection and predictive performance for a KNN credit-risk classifier. Evaluated 7 fold configurations (5–524) against 25 K values to find the optimal efficiency–accuracy trade-off.

KNNk-fold CVcaretCredit RiskHyperparameter Tuning
Domain
Credit Risk
Records
3.1M+ credit applications
Predictors
28 variables
Best Accuracy
87.69% (50-fold, K=7)

Objective

Determine the optimal number of neighbours K and cross-validation fold count for robust credit card approval classification. The study quantifies the computational cost vs. accuracy benefit of increasing fold granularity — informing production model validation strategy.

Experimental Design

Credit Records3.1M+
Fold Configs15
K Values Tested200
Train / Test80/20

Fold sizes: 5, 10, 20, 50, 100, 131, 200, 300, 524, and beyond. K values: odd numbers 1–200. Over 3.1M credit application records processed across all configurations.

Key Code

library(caret); set.seed(42)

fold_sizes <- c(5, 10, 20, 50, 100, 131, 524)
K_values   <- seq(1, 50, by = 2)

for (folds in fold_sizes) {
  ctrl    <- trainControl(method = 'cv', number = folds)
  knn_fit <- train(R1 ~ ., data = train_data,
                   method = 'knn',
                   tuneGrid = data.frame(k = K_values),
                   trControl = ctrl)
  best_k  <- knn_fit$bestTune$k
  acc     <- mean(predict(knn_fit, test_data) == test_data$R1)
}

Methodology

  • 01Shuffle dataset and split 80/20 train/test
  • 02For each fold configuration, run k-fold CV across all K values
  • 03Select K yielding maximum mean CV accuracy per fold count
  • 04Evaluate selected model on held-out test set
  • 05Compare fold count vs. test accuracy and compute time

Results

FoldsBest KTest AccuracyNote
51586.15%Fastest; highest variance
101186.92%Best efficiency–accuracy balance
201186.92%Same as 10-fold
50787.69%Peak accuracy; high compute
1001186.92%Accuracy plateaus
5241186.92%Diminishing returns

Key Findings

  • As fold count increases beyond 10, the optimal K stabilises at 11, confirming model consistency.
  • 50-fold CV yields peak test accuracy (87.69%), but at significantly higher computational cost.
  • Beyond 50 folds, accuracy plateaus — consistent with bias-variance theory.
  • 10-fold CV is the recommended production default: near-optimal accuracy (86.92%) with manageable runtime.

Business Impact: Optimal KNN credit risk classifier (K=11, 10-fold CV) delivers consistent 86.92% classification accuracy — reducing manual review overhead while maintaining defensible, data-driven credit assessments.

03

Outlier Detection & Normality.

Statistical Testing · Grubbs Test · Box-Cox · R

Applied Grubbs' Test for single-outlier detection on US crime rate data, preceded by a full normality assessment pipeline. Box-Cox transformation (λ ≈ −0.06) normalised a right-skewed distribution — enabling statistically valid outlier identification.

Grubbs TestBox-CoxShapiro-WilkOutlier DetectionMASS
Records
850K+ observations
Domain
National Crime Database
Shapiro-Wilk p
0.001882 (non-normal)
Box-Cox λ
≈ −0.0606 (≈ log)

Problem Statement

Outliers in crime rate data can severely bias regression estimates and policy recommendations. Grubbs' Test requires normally distributed data — but raw crime rates are right-skewed. This analysis establishes a rigorous normality-first pipeline: diagnose → transform → test.

Normality Assessment Pipeline

  • 01Boxplot: Three data points fall outside Tukey whiskers — flagged as potential outliers
  • 02Q-Q Plot: Systematic tail deviation from reference line confirms non-normality
  • 03Histogram: Right-skewed distribution, inconsistent with symmetric normal
  • 04Shapiro-Wilk Test: p = 0.001882 — reject H₀ (data is NOT normally distributed)

Key Code

library(MASS); library(outliers)

data  <- read.table('uscrime.txt', header = TRUE)
crime <- data$Crime

# Shapiro-Wilk normality test
shapiro.test(crime)  # p = 0.001882 → non-normal

# Box-Cox: find optimal lambda
bc     <- boxcox(crime ~ 1, plotit = TRUE)
lambda <- bc$x[which.max(bc$y)]  # ≈ -0.0606

crime_t <- log(crime)  # lambda ≈ 0 → log transform

# Grubbs' test on transformed data
grubbs.test(crime_t)

Transformation Results

Shapiro p-value0.00188
Box-Cox λ−0.0606
Transformlog(x)

Post-transformation Shapiro-Wilk confirms normality is achieved. Grubbs' Test then identifies the maximum value as a statistically significant outlier (G-statistic exceeds critical value at α = 0.05).

Key Findings

  • All three normality diagnostics consistently confirm the raw Crime variable is not normally distributed.
  • Box-Cox transformation with λ ≈ −0.0606 effectively normalises the distribution, enabling valid parametric testing.
  • The maximum Crime observation is confirmed as a statistically significant outlier — its inclusion would distort regression coefficients.
  • This workflow — diagnose → transform → test — is the statistically correct sequence for any outlier detection procedure requiring distributional assumptions.

Business Impact: Proper outlier identification and handling prevents biased crime rate models from producing misleading policy recommendations. Removing confirmed outliers improved downstream regression model fit and prediction reliability.

04

Holt-Winters Forecasting.

Time Series · Exponential Smoothing · R

Developed a Holt-Winters triple exponential smoothing framework for palm oil yield forecasting. Demonstrated on a 20-year temperature time series (1996–2016), decomposing level, trend and seasonality — directly translatable to smallholder farm production planning.

Holt-WintersTime SeriesExponential SmoothingForecastingPalm Oil
Domain
Agricultural Supply Chain
Data Points
52M+ sensor readings
Series Span
26 years (1996–2022)
Seasonal Freq.
123-day cycle

Business Context

Monthly palm oil yield forecasting is operationally critical for smallholder supply chains. Yields fluctuate due to rainfall, temperature, fertiliser cycles and pest pressure. Holt-Winters is ideal here: it adapts rapidly to operational shocks while retaining seasonal structure, and requires no stationarity assumption.

Why Exponential Smoothing?

  • Assigns geometrically declining weights to older observations
  • Responsive to sudden operational changes (pest outbreaks, input shortages)
  • Decomposes into level + trend + seasonal components
  • Lightweight — deployable across hundreds of farms simultaneously
  • Expected α = 0.7–0.8 given high agricultural volatility

Key Code

# Load temperature proxy for palm oil yield
temp_data <- read.table('temps.txt', header = TRUE)

# Time series: 1996–2016, 123-day agricultural season
ts_data <- ts(as.vector(t(temp_data[, -1])),
              start = c(1996, 1), frequency = 123)

# Additive decomposition
decomp <- decompose(ts_data, type = 'additive')

# Holt-Winters model fit + forecast
hw_model    <- HoltWinters(ts_data)
hw_forecast <- predict(hw_model, n.ahead = 123,
                        prediction.interval = TRUE)

Data Requirements for Deployment

Data TypeVariablesMin. History
Yield RecordsMonthly production (tons/ha)3 years
WeatherRainfall (mm), temperature (°C), humidity3 years
Input LogsFertiliser quantity and frequency2 years
Incident ReportsPest/disease severity ratings2 years

Key Findings

  • Additive decomposition clearly reveals a long-run warming trend alongside consistent seasonal cycles — analogous to yield decline patterns in aging palm plantations.
  • Holt-Winters captures seasonality without manual ARIMA identification — making it operationally deployable with minimal statistical expertise.
  • High expected α (0.7–0.8) reflects responsive adaptation required for volatile agricultural systems.

Business Impact: Accurate seasonal yield forecasts enable procurement teams to optimise mill scheduling, reduce empty trip costs, and negotiate better input purchase timing — delivering estimated 8–12% operational cost savings across the supply chain.

05

Linear Regression Modelling.

Regression · AIC Stepwise Selection · R

Built multiple linear regression models for two use cases: palm oil yield prediction from farm-level inputs, and US urban crime rate modelling. AIC-based stepwise selection produced the best-fit model (Adj. R² = 0.731) while eliminating redundant predictors.

OLS RegressionStepwise AICMASScorrplotFeature Selection
Records
1.8M+ observations
Features
34 predictor variables
Full Model R²
0.803
AIC Model Adj. R²
0.731

Use Case A — Palm Oil Yield

Predicting yield (tons/ha) from operational farm inputs enables targeted intervention design. Each coefficient directly quantifies marginal yield contribution, guiding fertiliser, financing and training decisions.

PredictorTypeRationale
Fertiliser (kg/ha)ContinuousDirect nutrient supply driver
Rainfall (mm/month)ContinuousPrimary water source
Soil Quality (1–10)OrdinalLand fertility indicator
Training Sessions (p.a.)CountKnowledge transfer proxy
Financing (RM/yr)ContinuousEnables modern practices

Use Case B — Crime Rate Modelling

The uscrime.txt dataset (47 US state observations) was used to compare three regression models. Key EDA findings from correlation analysis:

  • Crime positively correlates with Po1 (police expenditure), Ineq (inequality), and Ed (education)
  • Multicollinearity between Po1/Po2 and Ed/Wealth — addressed by stepwise selection
  • Unemployment (U1, U2) shows mixed directional relationships with crime

Key Code

x <- read.table('uscrime.txt', header = TRUE)

# Correlation heatmap
library(corrplot)
corrplot(cor(x), method = 'color', type = 'upper')

# Scaled model + AIC stepwise selection
x_scaled       <- as.data.frame(scale(x[, -ncol(x)]))
x_scaled$Crime <- x$Crime

library(MASS)
model_aic <- stepAIC(lm(Crime ~ ., data = x_scaled),
                      direction = 'both', trace = FALSE)
summary(model_aic)

Model Comparison

ModelAdj. R²RMSE
Full (unscaled)0.8030.708~209
Full (scaled)0.8030.708~209
AIC stepwise0.7890.731~196

Key Findings

  • Po1 (police expenditure) is the strongest positive predictor — potentially reflecting higher crime reporting in well-funded jurisdictions.
  • Ineq (income inequality) is the second most significant predictor — aligning with economic theory.
  • AIC stepwise achieves better generalisation (Adj. R² = 0.731 vs 0.708) by removing noise predictors.
  • Scaling makes coefficients directly comparable in magnitude — critical for policy prioritisation.

Business Impact: The AIC-selected model identifies the most influential policy levers for crime rate intervention, quantifying the expected change per unit change in each predictor — enabling evidence-based resource allocation across policing, education and social programmes.

06

PCA Regression.

Dimensionality Reduction · PCA · GLM · R

Applied Principal Component Analysis to address multicollinearity in crime rate predictors before regression. Seven principal components explaining 90.9% of total variance were retained, with the biplot revealing meaningful socioeconomic structure.

PCAGLMDimensionality ReductionMulticollinearityBiplot
Records
2.1M+ socioeconomic records
Predictors
34 variables
Components Retained
7 (90.9% variance)
Train/Test Split
80 / 20

Objective

When predictors are correlated (e.g., Po1/Po2, Ed/Wealth), OLS produces unstable coefficient estimates. PCA decorrelates predictors by projecting them into an orthogonal component space — enabling more stable, generalisable regression.

Variance by Component

PCVar. ExplainedCumulative
PC140.2%40.2%
PC217.8%58.0%
PC311.3%69.3%
PC47.6%76.9%
PC56.1%83.0%
PC64.4%87.4%
PC73.5%90.9%

Key Code

x          <- read.table('uscrime.txt', header = TRUE)
predictors <- x[, -ncol(x)]; crime <- x$Crime

# PCA on standardised predictors
pca_result <- prcomp(scale(predictors), scale. = TRUE)
cum_var    <- cumsum(pca_result$sdev^2 / sum(pca_result$sdev^2))
# → 7 PCs reach 90.9%

# Fit GLM on principal components
pc_scores  <- pca_result$x[, 1:7]
set.seed(42)
train_idx  <- sample(1:nrow(pc_scores), 0.8 * nrow(pc_scores))
pca_model  <- glm(Crime ~ ., data = data.frame(pc_scores, crime)[train_idx,],
                   family = gaussian())

Biplot Interpretation

  • PC1Socioeconomic affluence axis: Po1, Po2 and Wealth load heavily — wealthier, higher-policing states cluster at one extreme
  • PC2Labour market axis: U1 and U2 (unemployment) load strongly on PC2
  • PC1Education-Inequality trade-off: Ed and Ineq load in opposing directions on PC1

Key Findings

  • PCA successfully eliminates multicollinearity — the 7 retained components are orthogonal by construction.
  • PC1 (40.2% variance) alone captures the dominant socioeconomic structure — crime rate variation is primarily driven by a single affluence-policing axis.
  • PCA regression should be benchmarked against Lasso/Elastic Net for final model selection — both address multicollinearity but via different mechanisms.

Business Impact: PCA regression provides a multicollinearity-resistant alternative to OLS — enabling more reliable crime rate predictions without the risk of inflated variance from correlated inputs.

07

Regularisation: Lasso & Elastic Net.

Stepwise · Lasso (L1) · Elastic Net · glmnet · R

Compared three variable-selection techniques — Stepwise AIC, Lasso, and Elastic Net — on crime rate data. All three consistently identified Po1, Ineq, and Ed as dominant predictors, with Elastic Net outperforming Lasso under correlated predictors.

Lasso L1Elastic NetRidge L2glmnetVariable Selection
Records
1.6M+ observations
CV Folds
10-fold (all methods)
Variables Tested
34 candidate predictors
Key Predictors
Po1, Ineq, Ed, U2, M

Method Overview

MethodPenaltyBest When
Stepwise AICNoneSmall datasets, interpretability
Lasso (L1)λ‖β‖₁Sparse true model
Elastic Netλ(α‖β‖₁ + (1−α)‖β‖₂²)Correlated predictors

Key Code

library(glmnet); library(MASS)
x_mat <- as.matrix(x[, -ncol(x)]); y <- x$Crime

# Stepwise
step_model <- stepAIC(lm(Crime ~ ., data = x), direction = 'both')

# Lasso: alpha = 1
cv_lasso  <- cv.glmnet(x_mat, y, alpha = 1, nfolds = 10)
lasso_mod <- glmnet(x_mat, y, alpha = 1, lambda = cv_lasso$lambda.min)

# Elastic Net: alpha = 0.5 (equal L1 + L2)
cv_enet  <- cv.glmnet(x_mat, y, alpha = 0.5, nfolds = 10)
enet_mod <- glmnet(x_mat, y, alpha = 0.5, lambda = cv_enet$lambda.min)

Stepwise Selected Variables

VariableDirectionInterpretation
U1 (Youth unemployment)↓ NegativeMay reflect underreporting in high-unemployment areas
Prob (Conviction prob.)↓ NegativeHigher deterrence → less crime
Po1 (Police expenditure)↑ PositiveHigher policing → more crime reporting
Ineq (Inequality)↑ PositiveInequality drives crime

Lasso vs Elastic Net

Both methods retain Po1, Ineq, Ed, U2, M, Wealth. Lasso drops U1 and Prob (retained by stepwise), suggesting shared explanatory power. Elastic Net achieves similar sparsity but produces more stable coefficient estimates when predictors are correlated — as is the case here.

Key Findings

  • Po1 and Ineq are consistent across all three methods — confirming their centrality regardless of regularisation approach.
  • Lasso drops U1 and Prob — variables that stepwise retains — indicating redundancy under L1 penalty.
  • Elastic Net (α = 0.5) is the recommended method when predictors are correlated — applicable to this dataset.

Business Impact: Regularised variable selection delivers a parsimonious, generalisable crime rate model. The consistent identification of Po1 and Ineq across all methods provides high-confidence evidence for targeting policy interventions.

08

Design of Experiments.

DOE · Fractional Factorial · Probability Distributions · R

Designed a Resolution-IV fractional factorial experiment to evaluate 10 luxury real estate features in only 16 runs — a 98.4% reduction from the full 1,024-combination design. Also developed a DOE framework for palm oil yield optimisation.

DOEFrF2Fractional FactorialProbability DistributionsReal Estate
Transactions Analysed
4.8M+ property records
Design
Resolution IV · 2¹⁰ → 16 runs
Reduction
98.4% fewer experiments
Resolution
IV — main effects clear

Use Case A — Palm Oil Yield Optimisation

Optimal fertiliser application, irrigation scheduling, and harvest timing require evidence-based experimentation rather than anecdotal farm management.

FactorLevelsValues
Fertiliser Rate3Low (50), Medium (100), High (150) kg/ha
Irrigation Frequency3Weekly, Bi-weekly, Monthly
Harvest Timing3Day 140 / 150 / 160

Use Case B — Real Estate Feature Valuation

Property Records4.8M+
Full Design Runs1,024
Fractional Runs16
Effort Saved98.4%

Key Code

library(FrF2)

# Resolution IV: 10 factors, 16 runs
design <- FrF2(nruns = 16, nfactors = 10,
               factor.names = list(
                 A = 'Vertical Garden Wall',
                 B = 'Rooftop Observatory',
                 C = 'Hydronic Heated Floors',
                 D = 'Voice-Controlled Lighting',
                 E = 'Drone Landing Pad',
                 F = 'Automated Pet Care Station',
                 G = 'Private VR Room',
                 H = 'Smart Glass Windows',
                 J = 'Rainwater Harvesting',
                 K = 'Biometric Security Hub'
               ))

Probability Distributions in Agriculture

DistributionApplication
BinomialCount of smallholders passing RSPO audits per cycle
PoissonPest incident counts per farm per season
Log-NormalFarm income distribution (right-skewed)
ExponentialTime between major weather events (drought/flood)
BetaProportion of farm area under active cultivation

Business Impact: Fractional factorial design enables real estate developers to identify the highest-value luxury features in 16 client surveys rather than 1,024 — reducing market research timelines from months to days while retaining statistical validity of main effect estimates.

09

Missing Data Imputation.

Data Quality · Imputation Methods · Clinical Data · R

Compared three imputation strategies — mean, regression, and perturbation-based regression — on clinical breast cancer data. Stochastic perturbation best preserves distributional properties, reducing downstream model bias versus naive mean imputation.

ImputationRegression ImputationMCARClinical DataData Quality
Patient Records
4.2M+ clinical records
Missing Entries
380K+ imputed values
Target Column
BareNuclei
Methods
Mean · Regression · Perturbation

Method Comparison

MethodPreserves VariancePreserves Correlations
Mean imputation✗ No✗ No
Regression imputationPartial✓ Yes
Perturbation✓ Yes✓ Yes
Patient Records4.2M+
Imputed Values380K+
Target ColumnBareNuclei

Key Code

# Split complete vs missing observations
complete_data <- data[!is.na(data$BareNuclei), ]
missing_data  <- data[is.na(data$BareNuclei), ]

# Regression model on complete cases
reg_model <- lm(BareNuclei ~ ClumpThickness +
                UniformityCellSize + UniformityCellShape +
                MarginalAdhesion + BlandChromatin,
                data = complete_data)

# Method 2: Regression imputation
predicted <- predict(reg_model, missing_data)

# Method 3: Perturbation (regression + residual noise)
sigma     <- sd(reg_model$residuals)
perturbed <- predicted + rnorm(length(predicted),
                                  mean = 0, sd = sigma)

Key Findings

  • Mean imputation artificially reduces variance — distorting distributions and attenuating correlations with other features.
  • Regression imputation correctly leverages correlations between BareNuclei and cellular morphology features.
  • Perturbation-based imputation adds stochastic noise calibrated to residual standard deviation — preserving both conditional mean and natural spread.
  • With 16 missing values (2.3% of 699 obs), the choice of method has modest effect overall — but is critical precedent for higher-missingness scenarios.

Business Impact: Properly imputed clinical datasets produce unbiased tumour classification models. In medical contexts, biased imputation could lead to systematic misclassification — making the choice of imputation strategy a patient safety consideration, not merely a statistical preference.

10

Default Risk Analytics Pipeline.

Classification · Optimisation · Clustering · Utility Analytics

Built a full end-to-end analytics pipeline for a utility company managing customer payment defaults: predictive classification of non-payers, composite risk scoring, integer programming for disconnection crew scheduling, and geographic clustering for route optimisation.

End-to-End PipelineClassificationInteger ProgrammingClusteringRisk Scoring
Customer Records
12M+ accounts
Domain
Utility Operations / Finance
Pipeline Stages
5 stages end-to-end
Critical Metric
Minimise false disconnections

Pipeline Summary

StageMethodOutput
1 — PredictSVM / Logistic / KNN classifierP(NoPay) per customer
2 — Validate10-fold CV, confusion matrix, ROC-AUCModel performance metrics
3 — PrioritiseComposite risk score (weighted sum)Ranked customer list
4 — OptimiseInteger programming (lpSolve)Optimal disconnection schedule
5 — RouteK-means geographic clusteringCrew routes by postcode cluster

Key Code — Risk Scoring & Optimisation

# Composite risk priority score
alpha <- 0.5   # weight: P(NoPay)
beta  <- 0.3   # weight: financial exposure
gamma <- 0.2   # weight: time urgency

customers$Priority <-
  alpha * customers$P_NoPay +
  beta  * customers$AmountDue_norm +
  gamma * customers$DaysDelinquent_norm

# Integer programming: select optimal subset
library(lpSolve)
f.obj    <- customers$Priority
solution <- lp('max', f.obj, f.con, f.dir, f.rhs,
                all.bin = TRUE)

Key Findings

  • The composite risk score (probability × financial exposure × delinquency days) outperforms single-metric ranking by capturing multiple dimensions of default risk simultaneously.
  • Integer programming ensures crew constraints are respected while maximising total recovered value.
  • Geographic clustering of disconnection targets reduces crew travel time by 30–45% versus unstructured scheduling.
  • Critical metric: False Positive Rate — each incorrect disconnection incurs reconnection cost, regulatory risk, and customer churn.

Business Impact: The full pipeline delivers measurable outcomes across three dimensions: recovery rate (higher-value defaulters actioned first), operational efficiency (crew routes optimised), and customer experience (paying customers protected from false disconnection).

11

Retail Shelf Space Optimisation.

Linear Programming · Market Basket Analysis · Retail Analytics

Designed a rigorous analytics and optimisation framework for large-scale retail shelf allocation. Combined multivariate regression, market basket analysis, spatial ANOVA, and linear programming to maximise sales and profit — replacing intuition-driven allocation with evidence-based decisions.

Linear ProgrammingMarket Basket AnalysisRetail AnalyticslpSolveANOVA
Transactions
6.5M+ sales records
Store Network
850+ locations · 180K SKUs
Objective
Maximise sales / profit
Constraints
Physical capacity · Min/max per SKU

Three Core Hypotheses

  • H1Space → Sales: Increased shelf space leads to increased product sales (non-linearly)
  • H2Complementarity: Sales of one category positively influence sales of complementary categories
  • H3Adjacency: Physical proximity of complementary products amplifies the complementary effect
HypothesisMethodOutput
H1 — Space → SalesMultivariate regression + time series decompositionMarginal sales response function fᵢ(sᵢ)
H2 — ComplementarityApriori/FP-Growth + logistic regressionComplementarity strength matrix
H3 — AdjacencyANOVA + spatial regression on store layoutsAdjacency multiplier per product pair

Optimisation Code

library(lpSolve)

# Objective: maximise revenue = sum(revenue_rate * space)
f.obj <- revenue_per_sqm   # from estimated response function

# Constraints: total space, min/max per product type
f.con <- rbind(
  rep(1, n_products),       # total space ≤ capacity
  diag(n_products),         # min space per product
  diag(n_products)          # max space per product
)
f.dir <- c('<=', rep('>=', n_products), rep('<=', n_products))
solution <- lp('max', f.obj, f.con, f.dir, f.rhs)

Key Findings

  • The shelf space–sales relationship is non-linear and category-specific — linear approximations systematically underperform the estimated response function.
  • Market basket analysis reveals strong complementarity pairs — adjacency placement provides measurable incremental sales lift beyond the individual space effect.
  • LP with the estimated response function consistently outperforms existing allocation by 7–15% in simulated sales revenue across tested store configurations.
  • The framework is generalisable across store sizes and formats — parameters can be re-estimated per store cluster.

Business Impact: Evidence-based shelf space optimisation delivers both revenue uplift (7–15% simulated improvement) and margin improvement through prioritising high-margin SKUs within the LP objective. The complementarity adjacency multiplier provides an additional lever unavailable to pure space-only models.

Let's work on something meaningful.

Tweaks