If your client says they have an abundance of good quality data, be sure to take it with a pinch of salt. Data collection and processing are cost – and time-intensive tasks. There is always a chance that some data within the organization may not be of as high a quality as another data set. The problems in time series are often compounded by the time element. Due to challenges in data, organizations need to recalculate metrics and make changes to historical data, source additional data and build a mechanism to store it, as well as reconcile data when there are different definitions of data or a new data source doesn't reconcile with the previous sources. Data processing and collection may not be difficult for an organization if the requirement was for data going forward; historical data recalibration across a large time period, on the other hand, will present some challenges, as shown in the following diagram:
Challenges in data
Influencer variables
The relationship between the forecasted and dependent variable and other influencer or independent variables changes over a period of time. For example, most forecasting models weren't able to predict the global economic crash of 2008. Post-crash, modelers in a leading bank tried to rebuild their models with new variables that would be able to predict behavior better. Some of these new influencer variables weren't available in the central database. A vendor was selected to provide history and a continuing data feed to enable the availability of such variables in the future. In this case, since the influencer variables changed, the modeler had to look outside the scope of available variables in the central database and try to find better fitting variables.
Definition changes
A financial institution recently changed its definition of defaulting as it moved from a standard approach to an advanced, internal rating-based approach to potentially reduce its capital requirements and credit risk. The change to the Basel definition means that the institution's entire historical database needed to be modified. In this case, the new default definition in the institution is calculated using at least six other variables that need to be stored and checked for data quality across several years.
Granularity required
After changing influencer variables and running models for a couple of years, a modeler was informed by a data team that the central bank is planning to stop providing granular data for one of the modeling variables and that the metric would still be published but in an aggregated manner. This may impact the usability of such data for modeling purposes. A change in a variable in a regulatory environment has multiple overheads. In this scenario, the modeler would have to engage with the data team to understand which variable can be used as a substitute. The IT team would then have to ensure that variable (if not already available as a regular feed) is made available to the data team and the modeler. The material impact of the change in the variable on the modeling output would also have to be studied and documented. There might be an instance where a modeling governance team has to be notified of changes in the model. In an ideal governance environment, if code changes are to accommodate a new variable, testing would be undertaken before any change is implemented. A simple change in variable granularity can trigger the need for multiple subsequent tasks.
Legacy issues
Legacy issues may mean that some amount of information isn't available in the central database. This could be due to the fact that a system was upgraded recently and didn't have the capability to capture some data.
System differences
System differences arise because of a user's behavior or the way systems process data. Users of a telephone banking customer relationship management (CRM) system, for example, may not be capturing the incoming details of its callers, whereas branch officers using a different CRM frontend will be. The branch data may only be sparsely available (only when the customer divulges income), but has the potential to be more accurate.
Source constraints
Source constraints could arise simply because of the way some systems are designed. A system designed to store data at a customer level will not be able to efficiently store data that has been aggregated at an account level, for example. Vendor changes may impact data quality or the frequency of when data is available. Organizations also have differing archival policies, so if a modeler is looking to use time series data that goes as far back as a decade, some of this data may have already been archived and retrieval may be a time-consuming affair.
Vendor changes
Most organizations end up using a particular vendor for a long period of time. In some instances, it is the bank's transactional data system, in others the CRM tool or data mining software. In all of these cases, there is dependency on a vendor. Most vendors would like to develop a relationship with their client and grow alongside them, where contracts would be re-negotiated but with greater functionality and scalable software provided. There are times, however, when a client and a vendor decide to part ways. Be mindful that this can be a painful exercise and a lot of things can go wrong. Such a transition might lead to temporary or long-term breaks in data; some systems may be switched off and the new system replacing it might not be able to capture data in the same manner. A vendor who supplied customer leads or risk ratings may no longer be contracted and data quality may even suffer from a new vendor. Businesses and modelers need to be aware of these challenges.
Archiving policy
There is an archiving and retrieval policy in place in most organizations. However, retrieval can take time and may also lead to higher costs for sourcing data. These constraints could easily put off a modeler from exploring historical data that may have been archived. Archiving policy should therefore be based on the importance of the data, the regulatory requirement, and the ease of data retrieval.