Throughout the past few decades banks have gathered plenty of information describing the default behavior of their customers. Examples are historical information about a customerís date of birth, gender, income, employment status, etc. All this data has been nicely stored into huge (e.g. relational) databases or data warehouses.
On top of this, banks have accumulated lots of business experience about their credit products. As an example, many credit experts do a pretty good job at discriminating between low risk and high risk mortgages using their business expertise only. It is now the aim of credit scoring to analyze both sources of data into more detail and come up with a statistically based decision model which allows to score future credit applications and ultimately decide which ones to accept or reject.
The emergence of big data has created both opportunities and challenges for credit scoring. Big data is often characterized in terms of its four Vís: Volume, Variety, Velocity and Veracity. To illustrate this, letís briefly zoom into some key sources or processes generating big data.
Traditional sources are large-scale transactional enterprise systems such as OLTP (On-Line Transaction Processing), ERP (Enterprise Resource Planning) and CRM (Customer Relationship Management) applications. Classical credit scorecards are typically constructed using data extracted from these traditional transactional systems.
The online social graph is a more recent example. Think about the major social networks such as Facebook, Twitter, LinkedIn, Weibo, and WeChat. All together these networks capture information about close to two billion people relating to their friends, preferences, and other behavior, thereby leaving a massive digital trail of data. Also think about the Internet of Things (IOT) or the emerging sensor enabled ecosystem that is going to connect various objects (e.g. homes, cars, etc.) with each other, and with humans. Finally, we see more and more open or public data such as data about weather, traffic, maps, and macro-economy. It is clear that all these new data sources offer a tremendous potential for building better credit scoring models.
All the above data-generating processes can be characterized in terms of the sheer volume of data that is being generated. Clearly, this poses serious challenges in terms of setting up scalable storage architectures, combined with a distributed approach to data manipulation and querying.
Big data usually comes in a great variety or in various formats. Traditional data types, or structured data, such as customer name, customer birth date, etc., are more and more complemented with unstructured data such as images, fingerprints, tweets, emails, Facebook pages, sensor data, and GPS data. Although the former can be easily stored in traditional (e.g. relational) databases, the latter needs to be accommodated using the appropriate database technology facilitating the storage, querying, and manipulation of each of these types of unstructured data. Also, this requires a substantial effort since it is claimed that at least 80% of all data is unstructured.
Velocity refers to the speed at which the data is generated and needs to be stored and analyzed. Think about streaming applications such as on-line trading platforms, YouTube, SMS messages, credit card swipes, and phone calls, which are all examples where high velocity is a key concern.
Veracity indicates the quality or trustworthiness of the data. Unfortunately, more data does not automatically imply better data, so the quality of the data generating process must be closely monitored and guaranteed.
As the volume, variety, velocity, and veracity of data continue to grow, so do the new opportunities for building better credit scoring models. Think about Facebook or Twitter as an example. It is quite obvious that knowing a credit applicantís hobbies, followers, friends, likes, education, and workplace could be very beneficial to better quantify his/her creditworthiness. In other words, a customerís social standing, on-line reputation and professional connections are likely to be related to his/her credit quality. Another useful data source concerns call detail records or CDR data which capture the mobile phone usage of an applicant. Also surfing behavior could be a nice add-on.
Clearly, the availability of these big data sources creates both opportunities as well as challenges for credit scoring. For example, the availability of social network and CDR data may be beneficial in various settings. First, it may be useful to score customers who lack borrowing experience (e.g. because itís their first loan or they recently moved to a new country) and would be automatically perceived as risky according to traditional credit scoring models, which rely on historical information. By using these alternative data sources, a better assessment of the credit risk can be made, which can then be translated into a more favorable interest rate. This obviously gives an incentive to the customer to disclose his/her social network, CDR or other relevant data to the bank. Other examples are developing countries. In these countries, banks often lack historical credit information and no local credit bureaus may be available. Hence, other data sources should be used to optimize access to credit. Given the widespread use of social networks and/or mobile phones (even in developing countries!), the data gathered might be an interesting alternative to do credit scoring.
Obviously, using the above mentioned data sources also comes with various challenges. The first one concerns privacy. It is important that customers are properly informed about what data is used to calculate their credit score. An opt-out option should always be provided. Furthermore, using social network data for credit scoring can trigger new fraud behavior whereby customers strategically construct their social network to artificially and maliciously brush up their credit quality. One example is that customers can easily buy Twitter followers to boost their credit scores. Finally, regulatory compliance might become an important issue. Many countries prohibit the use of gender, age, marital status, national origin, ethnicity and beliefs for credit scoring. Much of this information can be easily scraped from social networks. Hence, it may be harder to oversee regulatory compliance when using social network or other data for credit scoring.
For more information, refer to Bartís On-Line Self-Paced E-learning course on Credit Risk Modeling, see Credit Risk Modeling. Bart Baesens is a professor at KU Leuven (Belgium), and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on big data and analytics, customer relationship management, web analytics, fraud detection, and credit risk management. His findings have been published in well-known international journals (e.g. Machine Learning, Management Science, IEEE Transactions on Neural Networks, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Evolutionary Computation, Journal of Machine Learning Research,) and presented at international top conferences. He is author of the books Analytics in a Big Data World, and Fraud Analytics using Descriptive, Predictive and Social Network Techniques and teaches E-learning courses on Advanced Analytics in a Big Data World and Credit Risk Modeling. His research is summarized at www.dataminingapps.com. He also regularly tutors, advises, and provides consulting support to international firms with respect to their big data, analytics and credit risk management strategy.