LION COMMUNITY USAGE CASE

Marketing automation for banks via SVM and cross-validation.

This is an exercise associated with the CaltechX: CS1156x Learning From Data (introductory Machine Learning course) by Caltech Professor Yaser Abu-Mostafa.
See our dedicated LIONsolver page for more resources.

Predicting success of a marketing effort

The data is about a direct marketing campaigns of a bank, based on phone calls. Often, more than one contact of the same potential customer was required, in order to determine if the product (bank term deposit) would (or would not) be bought. The goal is to predict if the client will subscribe or not (variable y). With a valid prediction, the marketing department can focus on the most promising leads and increase the overall ROI of the campaign.


There are two datasets:
1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).
2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.
The smallest dataset can be used to speedup the optimal model tuning with cross-validation (with SVM).

Data about customers:

Bank client data:
  • 1 - age (numeric)
  • 2 - job : type of job (categorical: 'admin.','unknown','unemployed','management','housemaid','entrepreneur','student', 'blue-collar','self-employed','retired','technician','services')
  • 3 - marital : marital status (categorical: 'married','divorced','single'; note: 'divorced' means divorced or widowed)
  • 4 - education (categorical: 'unknown','secondary','primary','tertiary')
  • 5 - default: has credit in default? (binary: 'yes','no')
  • 6 - balance: average yearly balance, in euros (numeric)
  • 7 - housing: has housing loan? (binary: 'yes','no')
  • 8 - loan: has personal loan? (binary: 'yes','no')
Data related to the last contact of the current campaign:
  • 9 - contact: contact communication type (categorical: 'unknown','telephone','cellular')
  • 10 - day: last contact day of the month (numeric)
  • 11 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  • 12 - duration: last contact duration, in seconds (numeric)
Other attributes:
  • 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  • 15 - previous: number of contacts performed before this campaign and for this client (numeric)
  • 16 - poutcome: outcome of the previous marketing campaign (categorical: 'unknown','other','failure','success')
Output variable (desired target):
  • 17 - y - has the client subscribed a term deposit? (binary: 'yes','no')

Determining the best model with cross-validation in SVM

Aiming at the best possible prediction is the objective. But prediction results depends on the model and on the procedure for learning from the data (and on the many parameters regulating the model and procedure), in this case Support Vector Machines (SVM) based on LIBSVM.
LIONoso multiple cross-validations tool in the SVM factory component dialogue permits an automated determination of the best model: you do not need to bother with details, just pick the best model determined in an automated manner and use it for your predictions. After clicking on multiple cross-validations and completing the analysis, the full table of results is available. By right-clicking on the SVM factory node you can immediately fit the obtained performance data with a polynomial and plot them with an output sweeper plot for a visual rendering of the results.


(Figure: running multiple cross-validation)

Determining the best parameters for the model can be done in an iterative manner. First one can run multiple cross-validation to search in the neighborhood of the "central" values written in the component dialogue. Then the best parameter values can be read out from the produced table (by ordering the "accuracy" column to get the highest result), and an additional instance of SVM factory can be created and attached to the same data, with the best obtained parameters as default values. Then multiple cross-validations can be run again to search in the neighborhood of the previous best values.


(Figure: running multiple cross-validation in an iterative manner: workbench)


(Figure: running multiple cross-validation in an iterative manner: dashboard)

Warning: there exist a trivial solution giving an accuracy of 88.40% on the bank.csv file. Which one? How can we avoid the model being "trapped" with this trivial solution? Which modified error measure should we consider?
Hint: think about how easy it is to predict the weather at Los Angeles with a large accurary :)

Download the LIONoso-ready file: marketing_cross_validation.lion
Download additional training and test files:

References:

Data collected from UCI Machine Learning Repository, Bank Marketing Data Set (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing ),
The full dataset was described and analyzed in: S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimaraes, Portugal, October, 2011. EUROSIS.
[AMLbook] Learning from data
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. 2012.
Download the LIONoso-ready file:marketing_cross_validation.lion