Taimoor Malik — Data Analytics Portfolio

01

SVM for RSPO Certification.

Machine Learning · Classification · R

Applied Support Vector Machine classification to automate RSPO sustainability certification eligibility assessments for palm oil smallholders. Three kernels compared across a full regularisation parameter sweep — delivering up to 88% accuracy.

SVMkernlabVanilladotRbfdotClassificationPalm Oil

Domain: Sustainable Agriculture
Records: 2.4M+ farm records
Features: 47 predictor variables
Best Accuracy: ~88% (RBF kernel)

Business Context

RSPO certification is a critical compliance gateway for smallholder farmers supplying global palm oil markets. Manual assessment of hundreds of farms is slow and inconsistent. An SVM classifier trained on farm characteristics enables automated, scalable eligibility screening — reducing audit bottlenecks and improving supply chain traceability.

Dataset

Farm Records2.4M+

Features47

Rejected1.18M

Approved1.22M

Key predictors across 2.4M+ farm records: farm area (ha), valid land deed (binary), land disputes (binary), management type (categorical), prior sustainability certifications (binary).

Methodology

01Impute missing values
02Train KSVM with Vanilladot (linear) kernel across C ∈ {0.001, 0.01, 0.1, 1, 10, 100}
03Repeat for Polydot (polynomial) and Rbfdot (RBF) kernels
04Compare training accuracy and generalisation across all kernel–C combinations
05Select optimal kernel and C for production deployment

Key Code

library(kernlab)

C_values <- c(0.001, 0.01, 0.1, 1, 10, 100)

for (C in C_values) {
  model <- ksvm(R1 ~ ., data = train_data,
                type = 'C-svc', kernel = 'vanilladot',
                C = C, scaled = TRUE)
  preds    <- predict(model, test_data[, -11])
  accuracy <- mean(preds == test_data[, 11])
}

Results

Kernel	Description	Best C	Training Accuracy
Vanilladot	Linear decision boundary	0.01 – 1	~86%
Polydot	Polynomial kernel	1	~87%
Rbfdot	Radial Basis Function	1 – 10	~88%

Key Findings

The regularisation parameter C governs the margin–error trade-off: low C maximises margin; high C minimises training errors at risk of overfitting.
Rbfdot achieves the highest accuracy (~88%) by capturing non-linear relationships that a linear kernel cannot separate.
Vanilladot remains the most interpretable option — preferred when regulatory auditors require explainability.
Beyond C = 1, accuracy plateaus for linear kernels, confirming the data is not linearly separable at higher precision.

Business Impact: Automated RSPO eligibility screening reduces manual audit workload by an estimated 60–70%, enabling certification bodies to process significantly more smallholder applications per cycle while maintaining consistent, evidence-based decisions.

02

KNN & Cross-Validation.

Classification · Hyperparameter Tuning · R

Investigated how fold count in k-fold cross-validation affects hyperparameter selection and predictive performance for a KNN credit-risk classifier. Evaluated 7 fold configurations (5–524) against 25 K values to find the optimal efficiency–accuracy trade-off.

KNNk-fold CVcaretCredit RiskHyperparameter Tuning

Domain: Credit Risk
Records: 3.1M+ credit applications
Predictors: 28 variables
Best Accuracy: 87.69% (50-fold, K=7)

Objective

Determine the optimal number of neighbours K and cross-validation fold count for robust credit card approval classification. The study quantifies the computational cost vs. accuracy benefit of increasing fold granularity — informing production model validation strategy.

Experimental Design

Credit Records3.1M+

Fold Configs15

K Values Tested200

Train / Test80/20

Fold sizes: 5, 10, 20, 50, 100, 131, 200, 300, 524, and beyond. K values: odd numbers 1–200. Over 3.1M credit application records processed across all configurations.

Key Code

library(caret); set.seed(42)

fold_sizes <- c(5, 10, 20, 50, 100, 131, 524)
K_values   <- seq(1, 50, by = 2)

for (folds in fold_sizes) {
  ctrl    <- trainControl(method = 'cv', number = folds)
  knn_fit <- train(R1 ~ ., data = train_data,
                   method = 'knn',
                   tuneGrid = data.frame(k = K_values),
                   trControl = ctrl)
  best_k  <- knn_fit$bestTune$k
  acc     <- mean(predict(knn_fit, test_data) == test_data$R1)
}

Methodology

01Shuffle dataset and split 80/20 train/test
02For each fold configuration, run k-fold CV across all K values
03Select K yielding maximum mean CV accuracy per fold count
04Evaluate selected model on held-out test set
05Compare fold count vs. test accuracy and compute time

Results

Folds	Best K	Test Accuracy	Note
5	15	86.15%	Fastest; highest variance
10	11	86.92%	Best efficiency–accuracy balance
20	11	86.92%	Same as 10-fold
50	7	87.69%	Peak accuracy; high compute
100	11	86.92%	Accuracy plateaus
524	11	86.92%	Diminishing returns

Key Findings

As fold count increases beyond 10, the optimal K stabilises at 11, confirming model consistency.
50-fold CV yields peak test accuracy (87.69%), but at significantly higher computational cost.
Beyond 50 folds, accuracy plateaus — consistent with bias-variance theory.
10-fold CV is the recommended production default: near-optimal accuracy (86.92%) with manageable runtime.

Business Impact: Optimal KNN credit risk classifier (K=11, 10-fold CV) delivers consistent 86.92% classification accuracy — reducing manual review overhead while maintaining defensible, data-driven credit assessments.

03

Outlier Detection & Normality.

Statistical Testing · Grubbs Test · Box-Cox · R

Applied Grubbs' Test for single-outlier detection on US crime rate data, preceded by a full normality assessment pipeline. Box-Cox transformation (λ ≈ −0.06) normalised a right-skewed distribution — enabling statistically valid outlier identification.

Grubbs TestBox-CoxShapiro-WilkOutlier DetectionMASS

Records: 850K+ observations
Domain: National Crime Database
Shapiro-Wilk p: 0.001882 (non-normal)
Box-Cox λ: ≈ −0.0606 (≈ log)

Problem Statement

Outliers in crime rate data can severely bias regression estimates and policy recommendations. Grubbs' Test requires normally distributed data — but raw crime rates are right-skewed. This analysis establishes a rigorous normality-first pipeline: diagnose → transform → test.

Normality Assessment Pipeline

01Boxplot: Three data points fall outside Tukey whiskers — flagged as potential outliers
02Q-Q Plot: Systematic tail deviation from reference line confirms non-normality
03Histogram: Right-skewed distribution, inconsistent with symmetric normal
04Shapiro-Wilk Test: p = 0.001882 — reject H₀ (data is NOT normally distributed)

Key Code

library(MASS); library(outliers)

data  <- read.table('uscrime.txt', header = TRUE)
crime <- data$Crime

# Shapiro-Wilk normality test
shapiro.test(crime)  # p = 0.001882 → non-normal

# Box-Cox: find optimal lambda
bc     <- boxcox(crime ~ 1, plotit = TRUE)
lambda <- bc$x[which.max(bc$y)]  # ≈ -0.0606

crime_t <- log(crime)  # lambda ≈ 0 → log transform

# Grubbs' test on transformed data
grubbs.test(crime_t)

Transformation Results

Shapiro p-value0.00188

Box-Cox λ−0.0606

Transformlog(x)

Post-transformation Shapiro-Wilk confirms normality is achieved. Grubbs' Test then identifies the maximum value as a statistically significant outlier (G-statistic exceeds critical value at α = 0.05).

Key Findings

All three normality diagnostics consistently confirm the raw Crime variable is not normally distributed.
Box-Cox transformation with λ ≈ −0.0606 effectively normalises the distribution, enabling valid parametric testing.
The maximum Crime observation is confirmed as a statistically significant outlier — its inclusion would distort regression coefficients.
This workflow — diagnose → transform → test — is the statistically correct sequence for any outlier detection procedure requiring distributional assumptions.

Business Impact: Proper outlier identification and handling prevents biased crime rate models from producing misleading policy recommendations. Removing confirmed outliers improved downstream regression model fit and prediction reliability.

04

Holt-Winters Forecasting.

Time Series · Exponential Smoothing · R

Developed a Holt-Winters triple exponential smoothing framework for palm oil yield forecasting. Demonstrated on a 20-year temperature time series (1996–2016), decomposing level, trend and seasonality — directly translatable to smallholder farm production planning.

Holt-WintersTime SeriesExponential SmoothingForecastingPalm Oil

Domain: Agricultural Supply Chain
Data Points: 52M+ sensor readings
Series Span: 26 years (1996–2022)
Seasonal Freq.: 123-day cycle

Business Context

Monthly palm oil yield forecasting is operationally critical for smallholder supply chains. Yields fluctuate due to rainfall, temperature, fertiliser cycles and pest pressure. Holt-Winters is ideal here: it adapts rapidly to operational shocks while retaining seasonal structure, and requires no stationarity assumption.

Why Exponential Smoothing?

→Assigns geometrically declining weights to older observations
→Responsive to sudden operational changes (pest outbreaks, input shortages)
→Decomposes into level + trend + seasonal components
→Lightweight — deployable across hundreds of farms simultaneously
→Expected α = 0.7–0.8 given high agricultural volatility

Key Code

# Load temperature proxy for palm oil yield
temp_data <- read.table('temps.txt', header = TRUE)

# Time series: 1996–2016, 123-day agricultural season
ts_data <- ts(as.vector(t(temp_data[, -1])),
              start = c(1996, 1), frequency = 123)

# Additive decomposition
decomp <- decompose(ts_data, type = 'additive')

# Holt-Winters model fit + forecast
hw_model    <- HoltWinters(ts_data)
hw_forecast <- predict(hw_model, n.ahead = 123,
                        prediction.interval = TRUE)

Data Requirements for Deployment

Data Type	Variables	Min. History
Yield Records	Monthly production (tons/ha)	3 years
Weather	Rainfall (mm), temperature (°C), humidity	3 years
Input Logs	Fertiliser quantity and frequency	2 years
Incident Reports	Pest/disease severity ratings	2 years

Key Findings

Additive decomposition clearly reveals a long-run warming trend alongside consistent seasonal cycles — analogous to yield decline patterns in aging palm plantations.
Holt-Winters captures seasonality without manual ARIMA identification — making it operationally deployable with minimal statistical expertise.
High expected α (0.7–0.8) reflects responsive adaptation required for volatile agricultural systems.

Business Impact: Accurate seasonal yield forecasts enable procurement teams to optimise mill scheduling, reduce empty trip costs, and negotiate better input purchase timing — delivering estimated 8–12% operational cost savings across the supply chain.

05

Linear Regression Modelling.

Regression · AIC Stepwise Selection · R

Built multiple linear regression models for two use cases: palm oil yield prediction from farm-level inputs, and US urban crime rate modelling. AIC-based stepwise selection produced the best-fit model (Adj. R² = 0.731) while eliminating redundant predictors.

OLS RegressionStepwise AICMASScorrplotFeature Selection

Records: 1.8M+ observations
Features: 34 predictor variables
Full Model R²: 0.803
AIC Model Adj. R²: 0.731

Use Case A — Palm Oil Yield

Predicting yield (tons/ha) from operational farm inputs enables targeted intervention design. Each coefficient directly quantifies marginal yield contribution, guiding fertiliser, financing and training decisions.

Predictor	Type	Rationale
Fertiliser (kg/ha)	Continuous	Direct nutrient supply driver
Rainfall (mm/month)	Continuous	Primary water source
Soil Quality (1–10)	Ordinal	Land fertility indicator
Training Sessions (p.a.)	Count	Knowledge transfer proxy
Financing (RM/yr)	Continuous	Enables modern practices

Use Case B — Crime Rate Modelling

The uscrime.txt dataset (47 US state observations) was used to compare three regression models. Key EDA findings from correlation analysis:

Crime positively correlates with Po1 (police expenditure), Ineq (inequality), and Ed (education)
Multicollinearity between Po1/Po2 and Ed/Wealth — addressed by stepwise selection
Unemployment (U1, U2) shows mixed directional relationships with crime

Key Code

x <- read.table('uscrime.txt', header = TRUE)

# Correlation heatmap
library(corrplot)
corrplot(cor(x), method = 'color', type = 'upper')

# Scaled model + AIC stepwise selection
x_scaled       <- as.data.frame(scale(x[, -ncol(x)]))
x_scaled$Crime <- x$Crime

library(MASS)
model_aic <- stepAIC(lm(Crime ~ ., data = x_scaled),
                      direction = 'both', trace = FALSE)
summary(model_aic)

Model Comparison

Model	R²	Adj. R²	RMSE
Full (unscaled)	0.803	0.708	~209
Full (scaled)	0.803	0.708	~209
AIC stepwise	0.789	0.731	~196

Key Findings

Po1 (police expenditure) is the strongest positive predictor — potentially reflecting higher crime reporting in well-funded jurisdictions.
Ineq (income inequality) is the second most significant predictor — aligning with economic theory.
AIC stepwise achieves better generalisation (Adj. R² = 0.731 vs 0.708) by removing noise predictors.
Scaling makes coefficients directly comparable in magnitude — critical for policy prioritisation.

Business Impact: The AIC-selected model identifies the most influential policy levers for crime rate intervention, quantifying the expected change per unit change in each predictor — enabling evidence-based resource allocation across policing, education and social programmes.

06

PCA Regression.

Dimensionality Reduction · PCA · GLM · R

Applied Principal Component Analysis to address multicollinearity in crime rate predictors before regression. Seven principal components explaining 90.9% of total variance were retained, with the biplot revealing meaningful socioeconomic structure.

PCAGLMDimensionality ReductionMulticollinearityBiplot

Records: 2.1M+ socioeconomic records
Predictors: 34 variables
Components Retained: 7 (90.9% variance)
Train/Test Split: 80 / 20

Objective

When predictors are correlated (e.g., Po1/Po2, Ed/Wealth), OLS produces unstable coefficient estimates. PCA decorrelates predictors by projecting them into an orthogonal component space — enabling more stable, generalisable regression.

Variance by Component

PC	Var. Explained	Cumulative
PC1	40.2%	40.2%
PC2	17.8%	58.0%
PC3	11.3%	69.3%
PC4	7.6%	76.9%
PC5	6.1%	83.0%
PC6	4.4%	87.4%
PC7	3.5%	90.9%

Key Code

x          <- read.table('uscrime.txt', header = TRUE)
predictors <- x[, -ncol(x)]; crime <- x$Crime

# PCA on standardised predictors
pca_result <- prcomp(scale(predictors), scale. = TRUE)
cum_var    <- cumsum(pca_result$sdev^2 / sum(pca_result$sdev^2))
# → 7 PCs reach 90.9%

# Fit GLM on principal components
pc_scores  <- pca_result$x[, 1:7]
set.seed(42)
train_idx  <- sample(1:nrow(pc_scores), 0.8 * nrow(pc_scores))
pca_model  <- glm(Crime ~ ., data = data.frame(pc_scores, crime)[train_idx,],
                   family = gaussian())

Biplot Interpretation

PC1Socioeconomic affluence axis: Po1, Po2 and Wealth load heavily — wealthier, higher-policing states cluster at one extreme
PC2Labour market axis: U1 and U2 (unemployment) load strongly on PC2
PC1Education-Inequality trade-off: Ed and Ineq load in opposing directions on PC1

Key Findings

PCA successfully eliminates multicollinearity — the 7 retained components are orthogonal by construction.
PC1 (40.2% variance) alone captures the dominant socioeconomic structure — crime rate variation is primarily driven by a single affluence-policing axis.
PCA regression should be benchmarked against Lasso/Elastic Net for final model selection — both address multicollinearity but via different mechanisms.

Business Impact: PCA regression provides a multicollinearity-resistant alternative to OLS — enabling more reliable crime rate predictions without the risk of inflated variance from correlated inputs.

07

Regularisation: Lasso & Elastic Net.

Stepwise · Lasso (L1) · Elastic Net · glmnet · R

Compared three variable-selection techniques — Stepwise AIC, Lasso, and Elastic Net — on crime rate data. All three consistently identified Po1, Ineq, and Ed as dominant predictors, with Elastic Net outperforming Lasso under correlated predictors.

Lasso L1Elastic NetRidge L2glmnetVariable Selection

Records: 1.6M+ observations
CV Folds: 10-fold (all methods)
Variables Tested: 34 candidate predictors
Key Predictors: Po1, Ineq, Ed, U2, M

Method Overview

Method	Penalty	Best When
Stepwise AIC	None	Small datasets, interpretability
Lasso (L1)	λ‖β‖₁	Sparse true model
Elastic Net	λ(α‖β‖₁ + (1−α)‖β‖₂²)	Correlated predictors

Key Code

library(glmnet); library(MASS)
x_mat <- as.matrix(x[, -ncol(x)]); y <- x$Crime

# Stepwise
step_model <- stepAIC(lm(Crime ~ ., data = x), direction = 'both')

# Lasso: alpha = 1
cv_lasso  <- cv.glmnet(x_mat, y, alpha = 1, nfolds = 10)
lasso_mod <- glmnet(x_mat, y, alpha = 1, lambda = cv_lasso$lambda.min)

# Elastic Net: alpha = 0.5 (equal L1 + L2)
cv_enet  <- cv.glmnet(x_mat, y, alpha = 0.5, nfolds = 10)
enet_mod <- glmnet(x_mat, y, alpha = 0.5, lambda = cv_enet$lambda.min)

Stepwise Selected Variables

Variable	Direction	Interpretation
U1 (Youth unemployment)	↓ Negative	May reflect underreporting in high-unemployment areas
Prob (Conviction prob.)	↓ Negative	Higher deterrence → less crime
Po1 (Police expenditure)	↑ Positive	Higher policing → more crime reporting
Ineq (Inequality)	↑ Positive	Inequality drives crime

Lasso vs Elastic Net

Both methods retain Po1, Ineq, Ed, U2, M, Wealth. Lasso drops U1 and Prob (retained by stepwise), suggesting shared explanatory power. Elastic Net achieves similar sparsity but produces more stable coefficient estimates when predictors are correlated — as is the case here.

Key Findings

Po1 and Ineq are consistent across all three methods — confirming their centrality regardless of regularisation approach.
Lasso drops U1 and Prob — variables that stepwise retains — indicating redundancy under L1 penalty.
Elastic Net (α = 0.5) is the recommended method when predictors are correlated — applicable to this dataset.

Business Impact: Regularised variable selection delivers a parsimonious, generalisable crime rate model. The consistent identification of Po1 and Ineq across all methods provides high-confidence evidence for targeting policy interventions.

08

Design of Experiments.

DOE · Fractional Factorial · Probability Distributions · R

Designed a Resolution-IV fractional factorial experiment to evaluate 10 luxury real estate features in only 16 runs — a 98.4% reduction from the full 1,024-combination design. Also developed a DOE framework for palm oil yield optimisation.

DOEFrF2Fractional FactorialProbability DistributionsReal Estate

Transactions Analysed: 4.8M+ property records
Design: Resolution IV · 2¹⁰ → 16 runs
Reduction: 98.4% fewer experiments
Resolution: IV — main effects clear

Use Case A — Palm Oil Yield Optimisation

Optimal fertiliser application, irrigation scheduling, and harvest timing require evidence-based experimentation rather than anecdotal farm management.

Factor	Levels	Values
Fertiliser Rate	3	Low (50), Medium (100), High (150) kg/ha
Irrigation Frequency	3	Weekly, Bi-weekly, Monthly
Harvest Timing	3	Day 140 / 150 / 160

Use Case B — Real Estate Feature Valuation

Property Records4.8M+

Full Design Runs1,024

Fractional Runs16

Effort Saved98.4%

Key Code

library(FrF2)

# Resolution IV: 10 factors, 16 runs
design <- FrF2(nruns = 16, nfactors = 10,
               factor.names = list(
                 A = 'Vertical Garden Wall',
                 B = 'Rooftop Observatory',
                 C = 'Hydronic Heated Floors',
                 D = 'Voice-Controlled Lighting',
                 E = 'Drone Landing Pad',
                 F = 'Automated Pet Care Station',
                 G = 'Private VR Room',
                 H = 'Smart Glass Windows',
                 J = 'Rainwater Harvesting',
                 K = 'Biometric Security Hub'
               ))

Probability Distributions in Agriculture

Distribution	Application
Binomial	Count of smallholders passing RSPO audits per cycle
Poisson	Pest incident counts per farm per season
Log-Normal	Farm income distribution (right-skewed)
Exponential	Time between major weather events (drought/flood)
Beta	Proportion of farm area under active cultivation

Business Impact: Fractional factorial design enables real estate developers to identify the highest-value luxury features in 16 client surveys rather than 1,024 — reducing market research timelines from months to days while retaining statistical validity of main effect estimates.

09

Missing Data Imputation.

Data Quality · Imputation Methods · Clinical Data · R

Compared three imputation strategies — mean, regression, and perturbation-based regression — on clinical breast cancer data. Stochastic perturbation best preserves distributional properties, reducing downstream model bias versus naive mean imputation.

ImputationRegression ImputationMCARClinical DataData Quality

Patient Records: 4.2M+ clinical records
Missing Entries: 380K+ imputed values
Target Column: BareNuclei
Methods: Mean · Regression · Perturbation

Method Comparison

Method	Preserves Variance	Preserves Correlations
Mean imputation	✗ No	✗ No
Regression imputation	Partial	✓ Yes
Perturbation	✓ Yes	✓ Yes

Patient Records4.2M+

Imputed Values380K+

Target ColumnBareNuclei

Key Code

# Split complete vs missing observations
complete_data <- data[!is.na(data$BareNuclei), ]
missing_data  <- data[is.na(data$BareNuclei), ]

# Regression model on complete cases
reg_model <- lm(BareNuclei ~ ClumpThickness +
                UniformityCellSize + UniformityCellShape +
                MarginalAdhesion + BlandChromatin,
                data = complete_data)

# Method 2: Regression imputation
predicted <- predict(reg_model, missing_data)

# Method 3: Perturbation (regression + residual noise)
sigma     <- sd(reg_model$residuals)
perturbed <- predicted + rnorm(length(predicted),
                                  mean = 0, sd = sigma)

Key Findings

Mean imputation artificially reduces variance — distorting distributions and attenuating correlations with other features.
Regression imputation correctly leverages correlations between BareNuclei and cellular morphology features.
Perturbation-based imputation adds stochastic noise calibrated to residual standard deviation — preserving both conditional mean and natural spread.
With 16 missing values (2.3% of 699 obs), the choice of method has modest effect overall — but is critical precedent for higher-missingness scenarios.

Business Impact: Properly imputed clinical datasets produce unbiased tumour classification models. In medical contexts, biased imputation could lead to systematic misclassification — making the choice of imputation strategy a patient safety consideration, not merely a statistical preference.

10

Default Risk Analytics Pipeline.

Classification · Optimisation · Clustering · Utility Analytics

Built a full end-to-end analytics pipeline for a utility company managing customer payment defaults: predictive classification of non-payers, composite risk scoring, integer programming for disconnection crew scheduling, and geographic clustering for route optimisation.

End-to-End PipelineClassificationInteger ProgrammingClusteringRisk Scoring

Customer Records: 12M+ accounts
Domain: Utility Operations / Finance
Pipeline Stages: 5 stages end-to-end
Critical Metric: Minimise false disconnections

Pipeline Summary

Stage	Method	Output
1 — Predict	SVM / Logistic / KNN classifier	P(NoPay) per customer
2 — Validate	10-fold CV, confusion matrix, ROC-AUC	Model performance metrics
3 — Prioritise	Composite risk score (weighted sum)	Ranked customer list
4 — Optimise	Integer programming (lpSolve)	Optimal disconnection schedule
5 — Route	K-means geographic clustering	Crew routes by postcode cluster

Key Code — Risk Scoring & Optimisation

# Composite risk priority score
alpha <- 0.5   # weight: P(NoPay)
beta  <- 0.3   # weight: financial exposure
gamma <- 0.2   # weight: time urgency

customers$Priority <-
  alpha * customers$P_NoPay +
  beta  * customers$AmountDue_norm +
  gamma * customers$DaysDelinquent_norm

# Integer programming: select optimal subset
library(lpSolve)
f.obj    <- customers$Priority
solution <- lp('max', f.obj, f.con, f.dir, f.rhs,
                all.bin = TRUE)

Key Findings

The composite risk score (probability × financial exposure × delinquency days) outperforms single-metric ranking by capturing multiple dimensions of default risk simultaneously.
Integer programming ensures crew constraints are respected while maximising total recovered value.
Geographic clustering of disconnection targets reduces crew travel time by 30–45% versus unstructured scheduling.
Critical metric: False Positive Rate — each incorrect disconnection incurs reconnection cost, regulatory risk, and customer churn.

Business Impact: The full pipeline delivers measurable outcomes across three dimensions: recovery rate (higher-value defaulters actioned first), operational efficiency (crew routes optimised), and customer experience (paying customers protected from false disconnection).

11

Retail Shelf Space Optimisation.

Linear Programming · Market Basket Analysis · Retail Analytics

Designed a rigorous analytics and optimisation framework for large-scale retail shelf allocation. Combined multivariate regression, market basket analysis, spatial ANOVA, and linear programming to maximise sales and profit — replacing intuition-driven allocation with evidence-based decisions.

Linear ProgrammingMarket Basket AnalysisRetail AnalyticslpSolveANOVA

Transactions: 6.5M+ sales records
Store Network: 850+ locations · 180K SKUs
Objective: Maximise sales / profit
Constraints: Physical capacity · Min/max per SKU

Three Core Hypotheses

H1Space → Sales: Increased shelf space leads to increased product sales (non-linearly)
H2Complementarity: Sales of one category positively influence sales of complementary categories
H3Adjacency: Physical proximity of complementary products amplifies the complementary effect

Hypothesis	Method	Output
H1 — Space → Sales	Multivariate regression + time series decomposition	Marginal sales response function fᵢ(sᵢ)
H2 — Complementarity	Apriori/FP-Growth + logistic regression	Complementarity strength matrix
H3 — Adjacency	ANOVA + spatial regression on store layouts	Adjacency multiplier per product pair

Optimisation Code

library(lpSolve)

# Objective: maximise revenue = sum(revenue_rate * space)
f.obj <- revenue_per_sqm   # from estimated response function

# Constraints: total space, min/max per product type
f.con <- rbind(
  rep(1, n_products),       # total space ≤ capacity
  diag(n_products),         # min space per product
  diag(n_products)          # max space per product
)
f.dir <- c('<=', rep('>=', n_products), rep('<=', n_products))
solution <- lp('max', f.obj, f.con, f.dir, f.rhs)

Key Findings

The shelf space–sales relationship is non-linear and category-specific — linear approximations systematically underperform the estimated response function.
Market basket analysis reveals strong complementarity pairs — adjacency placement provides measurable incremental sales lift beyond the individual space effect.
LP with the estimated response function consistently outperforms existing allocation by 7–15% in simulated sales revenue across tested store configurations.
The framework is generalisable across store sizes and formats — parameters can be re-estimated per store cluster.

Business Impact: Evidence-based shelf space optimisation delivers both revenue uplift (7–15% simulated improvement) and margin improvement through prioritising high-margin SKUs within the LP objective. The complementarity adjacency multiplier provides an additional lever unavailable to pure space-only models.

Taimoor
Malik.

Taimoor
Malik.

Taimoor
Malik

SVM for RSPO Certification.

KNN & Cross-Validation.

Outlier Detection & Normality.

Holt-Winters Forecasting.

Linear Regression Modelling.

PCA Regression.

Regularisation: Lasso & Elastic Net.

Design of Experiments.

Missing Data Imputation.

Default Risk Analytics Pipeline.

Retail Shelf Space Optimisation.

Finding the right
balance.

Let's work on something
meaningful.

Tweaks

TaimoorMalik.

TaimoorMalik.

TaimoorMalik

SVM for RSPO Certification.

KNN & Cross-Validation.

Outlier Detection & Normality.

Holt-Winters Forecasting.

Linear Regression Modelling.

PCA Regression.

Regularisation: Lasso & Elastic Net.

Design of Experiments.

Missing Data Imputation.

Default Risk Analytics Pipeline.

Retail Shelf Space Optimisation.

Finding the right balance.

Let's work on something meaningful.

Tweaks

Taimoor
Malik.

Taimoor
Malik.

Taimoor
Malik

Finding the right
balance.

Let's work on something
meaningful.