Compare Probability of Default Using Through-the-Cycle and Point-in-Time Models
This example shows how to work with consumer credit panel data to create through-the-cycle (TTC) and point-in-time (PIT) models and compare their respective probabilities of default (PD).
The PD of an obligor is a fundamental risk parameter in credit risk analysis. The PD of an obligor depends on customer-specific risk factors as well as macroeconomic risk factors. Because they incorporate macroeconomic conditions differently, TTC and PIT models produce different PD estimates.
A TTC credit risk measure primarily reflects the credit risk trend of a customer over the long term. Transient, short-term changes in credit risk that are likely to be reversed with the passage of time get smoothed out. The predominant features of TTC credit risk measures are their high degree of stability over the credit cycle and the smoothness of change over time.
A PIT credit risk measure utilizes all available and pertinent information as of a given date to estimate the PD of a customer over a given time horizon. The information set includes not just expectations about the credit risk trend of a customer over the long term but also geographic, macroeconomic, and macro-credit trends.
Previously, according to the Basel II rules, regulators called for the use of TTC PDs, losses given default (LGDs), and exposures at default (EADs). However, with to the new IFRS9 and proposed CECL accounting standards, regulators now require institutions to use PIT projections of PDs, LGDs, and EADs. By accounting for the current state of the credit cycle, PIT measures closely track the variations in default and loss rates over time.
Load Panel Data
The main data set in this example (data
) contains the following variables:
ID —
Loan identifier.ScoreGroup —
Credit score at the beginning of the loan, discretized into three groups:High Risk
,Medium Risk
, andLow Risk
.YOB —
Years on books.Default —
Default indicator. This is the response variable.Year —
Calendar year.
The data also includes a small data set (dataMacro
) with macroeconomic data for the corresponding calendar years:
Year —
Calendar year.GDP —
Gross domestic product growth (year over year).Market —
Market return (year over year).
The variables YOB
, Year
, GDP
, and Market
are observed at the end of the corresponding calendar year. ScoreGroup
is a discretization of the original credit score when the loan started. A value of 1
for Default
means that the loan defaulted in the corresponding calendar year.
This example uses simulated data, but you can apply the same approach to real data sets.
Load the data and view the first 10 rows of the table. The panel data is stacked and the observations for the same ID are stored in contiguous rows, creating a tall, thin table. The panel is unbalanced because not all IDs have the same number of observations.
load RetailCreditPanelData.mat
disp(head(data,10));
ID ScoreGroup YOB Default Year __ ___________ ___ _______ ____ 1 Low Risk 1 0 1997 1 Low Risk 2 0 1998 1 Low Risk 3 0 1999 1 Low Risk 4 0 2000 1 Low Risk 5 0 2001 1 Low Risk 6 0 2002 1 Low Risk 7 0 2003 1 Low Risk 8 0 2004 2 Medium Risk 1 0 1997 2 Medium Risk 2 0 1998
nRows = height(data);
UniqueIDs = unique(data.ID);
nIDs = length(UniqueIDs);
fprintf('Total number of IDs: %d\n',nIDs)
Total number of IDs: 96820
fprintf('Total number of rows: %d\n',nRows)
Total number of rows: 646724
Default Rates by Year
Use Year
as a grouping variable to compute the observed default rate for each year. Use the groupsummary
function to compute the mean of the Default
variable, grouping by the Year
variable. Plot the results on a scatter plot which shows that the default rate goes down as the years increase.
DefaultPerYear = groupsummary(data,'Year','mean','Default'); NumYears = height(DefaultPerYear); disp(DefaultPerYear)
Year GroupCount mean_Default ____ __________ ____________ 1997 35214 0.018629 1998 66716 0.013355 1999 94639 0.012733 2000 92891 0.011379 2001 91140 0.010742 2002 89847 0.010295 2003 88449 0.0056417 2004 87828 0.0032905
subplot(2,1,1) scatter(DefaultPerYear.Year, DefaultPerYear.mean_Default*100,'*'); grid on xlabel('Year') ylabel('Default Rate (%)') title('Default Rate per Year') % Get IDs of the 1997, 1998, and 1999 cohorts IDs1997 = data.ID(data.YOB==1&data.Year==1997); IDs1998 = data.ID(data.YOB==1&data.Year==1998); IDs1999 = data.ID(data.YOB==1&data.Year==1999); % Get default rates for each cohort separately ObsDefRate1997 = groupsummary(data(ismember(data.ID,IDs1997),:),... 'YOB','mean','Default'); ObsDefRate1998 = groupsummary(data(ismember(data.ID,IDs1998),:),... 'YOB','mean','Default'); ObsDefRate1999 = groupsummary(data(ismember(data.ID,IDs1999),:),... 'YOB','mean','Default'); % Plot against the calendar year Year = unique(data.Year); subplot(2,1,2) plot(Year,ObsDefRate1997.mean_Default*100,'-*') hold on plot(Year(2:end),ObsDefRate1998.mean_Default*100,'-*') plot(Year(3:end),ObsDefRate1999.mean_Default*100,'-*') hold off title('Default Rate vs. Calendar Year') xlabel('Calendar Year') ylabel('Default Rate (%)') legend('Cohort 97','Cohort 98','Cohort 99') grid on
The plot shows that the default rate decreases over time. Notice in the plot that loans starting in the years 1997, 1998, and 1999 form three cohorts. No loan in the panel data starts after 1999. This is depicted in more detail in the "Years on Books Versus Calendar Years" section of the example on Stress Testing of Consumer Credit Default Probabilities Using Panel Data. The decreasing trend in this plot is explained by the fact that there are only three cohorts in the data and that the pattern for each cohort is decreasing.
TTC Model Using ScoreGroup
and Years on Books
TTC models are largely unaffected by economic conditions. The first TTC model in this example uses only ScoreGroup
and YOB
as predictors of the default rate.
Generate training and testing data sets by splitting the existing data into training and testing data sets that are used for model creation and validation, respectively.
NumTraining = floor(0.6*nIDs);
rng('default');
TrainIDInd = randsample(nIDs,NumTraining);
TrainDataInd = ismember(data.ID,UniqueIDs(TrainIDInd));
TestDataInd = ~TrainDataInd;
Use the fitLifetimePDModel
function to fit a Logistic
model.
TTCModel = fitLifetimePDModel(data(TrainDataInd,:),'logistic',... 'ModelID','TTC','IDVar','ID','AgeVar','YOB','LoanVars','ScoreGroup',... 'ResponseVar','Default'); disp(TTCModel.Model)
Compact generalized linear regression model: logit(Default) ~ 1 + ScoreGroup + YOB Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ________ ________ _______ ___________ (Intercept) -3.2453 0.033768 -96.106 0 ScoreGroup_Medium Risk -0.7058 0.037103 -19.023 1.1014e-80 ScoreGroup_Low Risk -1.2893 0.045635 -28.253 1.3076e-175 YOB -0.22693 0.008437 -26.897 2.3578e-159 388018 observations, 388014 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 1.83e+03, p-value = 0
Predict the PD for the training and testing data sets using predict
.
data.TTCPD = zeros(height(data),1); % Predict the in-sample data.TTCPD(TrainDataInd) = predict(TTCModel,data(TrainDataInd,:)); % Predict the out-of-sample data.TTCPD(TestDataInd) = predict(TTCModel,data(TestDataInd,:));
Visualize the in-sample fit and out-of-sample fit using modelCalibrationPlot
.
figure; subplot(2,1,1) modelCalibrationPlot(TTCModel,data(TrainDataInd,:),'Year','DataID',"Training Data") subplot(2,1,2) modelCalibrationPlot(TTCModel,data(TestDataInd,:),'Year','DataID',"Testing Data")
PIT Model Using ScoreGroup
, Years on Books, GDP, and Market Returns
PIT models vary with the economic cycle. The PIT model in this example uses ScoreGroup
, YOB
, GDP
, and Market
as predictors of the default rate. Use the fitLifetimePDModel
function to fit a Logistic
model.
% Add the GDP and Market returns columns to the original data
data = join(data, dataMacro);
disp(head(data,10))
ID ScoreGroup YOB Default Year TTCPD GDP Market __ ___________ ___ _______ ____ _________ _____ ______ 1 Low Risk 1 0 1997 0.0084797 2.72 7.61 1 Low Risk 2 0 1998 0.0067697 3.57 26.24 1 Low Risk 3 0 1999 0.0054027 2.86 18.1 1 Low Risk 4 0 2000 0.0043105 2.43 3.19 1 Low Risk 5 0 2001 0.0034384 1.26 -10.51 1 Low Risk 6 0 2002 0.0027422 -0.59 -22.95 1 Low Risk 7 0 2003 0.0021867 0.63 2.78 1 Low Risk 8 0 2004 0.0017435 1.85 9.48 2 Medium Risk 1 0 1997 0.015097 2.72 7.61 2 Medium Risk 2 0 1998 0.012069 3.57 26.24
PITModel = fitLifetimePDModel(data(TrainDataInd,:),'logistic',... 'ModelID','PIT','IDVar','ID','AgeVar','YOB','LoanVars','ScoreGroup',... 'MacroVars',{'GDP' 'Market'},'ResponseVar','Default'); disp(PITModel.Model)
Compact generalized linear regression model: logit(Default) ~ 1 + ScoreGroup + YOB + GDP + Market Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue __________ _________ _______ ___________ (Intercept) -2.667 0.10146 -26.287 2.6919e-152 ScoreGroup_Medium Risk -0.70751 0.037108 -19.066 4.8223e-81 ScoreGroup_Low Risk -1.2895 0.045639 -28.253 1.2892e-175 YOB -0.32082 0.013636 -23.528 2.0867e-122 GDP -0.12295 0.039725 -3.095 0.0019681 Market -0.0071812 0.0028298 -2.5377 0.011159 388018 observations, 388012 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 1.97e+03, p-value = 0
Predict the PD for training and testing data sets using predict
.
data.PITPD = zeros(height(data),1); % Predict in-sample data.PITPD(TrainDataInd) = predict(PITModel,data(TrainDataInd,:)); % Predict out-of-sample data.PITPD(TestDataInd) = predict(PITModel,data(TestDataInd,:));
Visualize the in-sample fit and out-of-sample fit using modelCalibrationPlot
.
figure; subplot(2,1,1) modelCalibrationPlot(PITModel,data(TrainDataInd,:),'Year','DataID',"Training Data") subplot(2,1,2) modelCalibrationPlot(PITModel,data(TestDataInd,:),'Year','DataID',"Testing Data")
In the PIT model, as expected, the predictions match the observed default rates more closely than in the TTC model. Although this example uses simulated data, qualitatively, the same type of model improvement is expected when moving from TTC to PIT models for real world data, although the overall error might be larger than in this example. The PIT model fit is typically better than the TTC model fit and the predictions typically match the observed rates.
Calculate TTC PD Using the PIT Model
Another approach for calculating TTC PDs is to use the PIT model and then replace the GDP
and Market
returns with the respective average values. In this approach, you use the mean values over an entire economic cycle (or an even longer period) so that only baseline economic conditions influence the model, and any variability in default rates is due to other risk factors. You can also enter forecasted baseline values for the economy that are different from the mean observed for the most recent economic cycle. For example, using the median instead of the mean reduces the error.
You can also use this approach of calculating TTC PDs by using the PIT model as a tool for scenario analysis, however; this cannot be done in the first version of the TTC model. The added advantage of this approach is that you can use a single model for both the TTC and PIT predictions. This means that you need to validate and maintain only one model.
% Modify the data to replace the GDP and Market returns with the corresponding average values
data.GDP(:) = median(data.GDP);
data.Market = repmat(mean(data.Market), height(data), 1);
disp(head(data,10));
ID ScoreGroup YOB Default Year TTCPD GDP Market PITPD __ ___________ ___ _______ ____ _________ ____ ______ _________ 1 Low Risk 1 0 1997 0.0084797 1.85 3.2263 0.0093187 1 Low Risk 2 0 1998 0.0067697 1.85 3.2263 0.005349 1 Low Risk 3 0 1999 0.0054027 1.85 3.2263 0.0044938 1 Low Risk 4 0 2000 0.0043105 1.85 3.2263 0.0038285 1 Low Risk 5 0 2001 0.0034384 1.85 3.2263 0.0035402 1 Low Risk 6 0 2002 0.0027422 1.85 3.2263 0.0035259 1 Low Risk 7 0 2003 0.0021867 1.85 3.2263 0.0018336 1 Low Risk 8 0 2004 0.0017435 1.85 3.2263 0.0010921 2 Medium Risk 1 0 1997 0.015097 1.85 3.2263 0.016554 2 Medium Risk 2 0 1998 0.012069 1.85 3.2263 0.0095319
Predict the PD for training and testing data sets using predict
.
data.TTCPD2 = zeros(height(data),1); % Predict in-sample data.TTCPD2(TrainDataInd) = predict(PITModel,data(TrainDataInd,:)); % Predict out-of-sample data.TTCPD2(TestDataInd) = predict(PITModel,data(TestDataInd,:));
Visualize the in-sample fit and out-of-sample fit using modelCalibrationPlot
.
f = figure; subplot(2,1,1) modelCalibrationPlot(PITModel,data(TrainDataInd,:),'Year','DataID',"Training, Macro Average") subplot(2,1,2) modelCalibrationPlot(PITModel,data(TestDataInd,:),'Year','DataID',"Testing, Macro Average")
Reset original values of the GDP
and Market
variables. The TTC PD values predicted using the PIT model and median or mean macro values are stored in the TTCPD2
column and that column is used to compare the predictions against other models below.
data.GDP = []; data.Market = []; data = join(data,dataMacro); disp(head(data,10))
ID ScoreGroup YOB Default Year TTCPD PITPD TTCPD2 GDP Market __ ___________ ___ _______ ____ _________ _________ _________ _____ ______ 1 Low Risk 1 0 1997 0.0084797 0.0093187 0.010688 2.72 7.61 1 Low Risk 2 0 1998 0.0067697 0.005349 0.0077772 3.57 26.24 1 Low Risk 3 0 1999 0.0054027 0.0044938 0.0056548 2.86 18.1 1 Low Risk 4 0 2000 0.0043105 0.0038285 0.0041093 2.43 3.19 1 Low Risk 5 0 2001 0.0034384 0.0035402 0.0029848 1.26 -10.51 1 Low Risk 6 0 2002 0.0027422 0.0035259 0.0021674 -0.59 -22.95 1 Low Risk 7 0 2003 0.0021867 0.0018336 0.0015735 0.63 2.78 1 Low Risk 8 0 2004 0.0017435 0.0010921 0.0011422 1.85 9.48 2 Medium Risk 1 0 1997 0.015097 0.016554 0.018966 2.72 7.61 2 Medium Risk 2 0 1998 0.012069 0.0095319 0.013833 3.57 26.24
Compare the Models
First, compare the two versions of the TTC model.
Compare the model discrimination using modelDiscriminationPlot
. The two models have very similar performance ranking customers, as measured by the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUROC, or simply AUC) metric.
figure; modelDiscriminationPlot(TTCModel,data(TestDataInd,:),"DataID",'Testing data',"ReferencePD",data.TTCPD2(TestDataInd),"ReferenceID",'TTC 2, Macro Average')
However, the TTC model is more accurate, the predicted PD values are closer to the observed default rates. The plot generated using modelCalibrationPlot
demonstrates that the root mean squared error (RMSE) reported in the plot confirms the TTC model is more accurate for this data set.
modelCalibrationPlot(TTCModel,data(TestDataInd,:),'Year',"DataID",'Testing data',"ReferencePD",data.TTCPD2(TestDataInd),"ReferenceID",'TTC 2, Macro Average')
Use modelDiscriminationPlot
to compare the TTC model and the PIT model.
The AUROC is only slightly better for the PIT model, showing that both models are comparable regarding ranking customers by risk.
figure; modelDiscriminationPlot(TTCModel,data(TestDataInd,:),"DataID",'Testing data',"ReferencePD",data.PITPD(TestDataInd),"ReferenceID",'PIT')
Use modelCalibrationPlot
to visualize the model accuracy, or model calibration. The plot shows that the PIT model performs much better, with predicted PD values much closer to the observed default rates. This is expected, since the predictions are sensitive to the macro variables, whereas the TTC model only uses the initial score and the age of the model to make predictions.
modelCalibrationPlot(TTCModel,data(TestDataInd,:),'Year',"DataID",'Testing data',"ReferencePD",data.PITPD(TestDataInd),"ReferenceID",'PIT')
You can use modelDiscrimination
to programmatically access the AUROC and the RMSE without creating a plot.
DiscMeasure = modelDiscrimination(TTCModel,data(TestDataInd,:),"DataID",'Testing data',"ReferencePD",data.PITPD(TestDataInd),"ReferenceID",'PIT'); disp(DiscMeasure)
AUROC _______ TTC, Testing data 0.68662 PIT, Testing data 0.69341
CalMeasure = modelCalibration(TTCModel,data(TestDataInd,:),'Year',"DataID",'Testing data',"ReferencePD",data.PITPD(TestDataInd),"ReferenceID",'PIT'); disp(CalMeasure)
RMSE _________ TTC, grouped by Year, Testing data 0.0019761 PIT, grouped by Year, Testing data 0.0006322
Although all models have comparable discrimination power, the accuracy of the PIT model is much better. However, TTC and PIT models are often used for different purposes, and the TTC model may be preferred if having more stable predictions over time is important.
References
Generalized Linear Models documentation, see Generalized Linear Models.
Baesens, B., D. Rosch, and H. Scheule. Credit Risk Analytics. Wiley, 2016.