A quick tour of concepts relevant to data integrity management
Making better business decisions relies on having accurate and trustworthy financial data. To help us get started, we’ll begin with several foundational concepts, which will be essential in understanding the topics in later chapters.
Levenshtein distance
With companies often dealing with transactions and records from multiple sources, utilizing string similarity algorithms such as the Levenshtein distance can help reconcile these datasets by matching similar entries especially when there are issues finding the exact match due to typos or minor discrepancies.
The Levenshtein distance, invented by Vladimir Levenshtein, measures the similarity between two strings by counting the number of edits needed to transform one word into another. It quantifies this similarity in terms of inserting, deleting, or substituting characters required for the conversion. Let us take a simple example between health
and wealth
. The computed distance is one (1
) since it will take a single edit operation to substitute h
with w
.
To help demonstrate how this metric works, here are a few more examples:
- From
rat
tocat
:1
- From
book
toback
:2
- From
saturday
tosunday
:3
- From
apple
toapricot
:5
This can be used for identifying and managing data integrity issues, as it can help detect potential duplicate entries as well as any typographic errors in the data. The Levenshtein distance can be used to check whether an account already exists in the database by looking for similar names in the database (with the search results sorted with the smallest computed distance presented first).
Note
For example, searching Sara Lat
in a database should return results including Sarah Lat
since the Levenshtein distance is very small (just a one-letter difference). Here, even if there are encoding issues, the algorithm can still identify and match similar entries despite not having an exact match. Imagine the possibilities this opens up for enhancing data retrieval accuracy, particularly in large databases where typos, abbreviations, or minor discrepancies are common! For more information on this topic, feel free to check https://en.wikipedia.org/wiki/Levenshtein_distance.
Machine learning
Machine learning (ML) is a subset of artificial intelligence (AI) that enables computers to learn from data and mimic human intelligence. One of the practical applications of machine learning is anomaly detection, which involves identifying unusual patterns or outliers in the datasets. This is particularly useful for detecting unusual transactions that may indicate fraudulent activity automatically. Now, instead of looking for unusual transactions manually, machine learning-powered automated systems can efficiently process and analyze large amounts of transactional data in real-time to flag discrepancies and issues. Awesome, right?
Note
Recently, powerful AI-powered solutions such as ChatGPT and Google Bard became available for a wide range of applications including data quality management. We’ll cover this in more detail in Chapter 10, Using Artificial Intelligence for Finance Data Quality Management.
Orphaned records
Orphaned records are database records whose corresponding parent or related record(s) supposedly stored in another database table no longer exists. This situation may occur due to various reasons such as a record being deleted or modified without the related record(s) being updated as well.
Let’s say we have two related tables called Product Details
and Transactions
, as seen in Figure 1.2. We can see that the Product_ID
column in the Product Details
table contains additional details about the product (for example, the product name, price, and cost). The Transactions
table, on the other hand, contains information on when a product was sold and how many were sold.
Figure 1.2 – Related tables being connected by the Product_ID column
Here, the Product_ID
column serves as the bridge connecting these two tables. We can see that there are four unique Product_ID
values under the Transactions
table, which have matching records in the Product
Details
table.
Figure 1.3 – Orphaned records in the Transactions table
If, for example, the record with Product_ID = 151002
was inadvertently deleted from the Product Details
table (similar to what we have in Figure 1.3), we would have several orphaned records in the Transactions
table not having the corresponding Product Details
record in the first table. This could cause reconciliation errors that can become discrepancies in financial reports.
Note
We will cover how to detect orphaned records and manage data integrity in our hands-on examples in Chapter 6, Implementing Best Practices When Using Business Intelligence Tools, and Chapter 10, Using Artificial Intelligence for Finance Data Quality Management.
Financial reporting
Distributing financial data to interested parties, such as creditors, investors, and regulatory agencies, is the process of financial reporting. Since they form the basis for the financial statements that provide a snapshot of an organization’s financial status, taking care of data integrity is essential for accurate financial reporting.
One of the ways that an organization can ensure the integrity of financial reports is by setting up internal controls. These processes are set up by the company to provide a level of reasonable assurance as to the reliability of the financial statements, prevent fraud, and ensure compliance with regulations. It is important to note that if there are errors or discrepancies relating to inaccurate or misleading reporting, the company is at risk of penalties and fines.
Note
Being data-rich and information-poor is a common challenge. This emphasizes the importance of data integrity in transforming raw data into usable information, as exemplified in the creation and analysis of financial reports.
Balance sheet
The company’s financial position and state can be understood by referring to the balance sheet. At a specific point in time, it discloses information regarding its assets, liabilities, and shareholders’ equity. The balance sheet, together with the statement of profit and loss, and the statement of cash flows are the most common financial statements prepared by businesses. We will discuss the balance sheet here and the other two in the later part of this section.
These financial statements can be prepared by different individuals depending on the size of the company. If it is a small business, the owner can prepare it. Alternatively, external accountants can be hired to assist in the preparation of these reports. A balance sheet is usually prepared on a monthly basis or depending on the needs of the business.
Note
In order to make sure that the numbers add up, accounts reconciliation procedures are done as part of the internal controls. After the balance sheet has been prepared and if the company is publicly held, public accounting firms can be hired to review the balance sheet and conduct external audits.
Nowadays, companies may utilize software to speed up the preparation of financial reports. However, despite this, data integrity issues can still be present since the process of accounts reconciliation and recording may still be manual.
Figure 1.4 – Sample balance sheet
Let’s have a quick look at an example of a balance sheet. In Figure 1.4, we have a sample report for a hypothetical business detailing the company’s assets, liabilities, and shareholders’ equity. To help us understand the balance sheet better, we need to familiarize ourselves with the following key concepts:
- Asset: An asset is a resource that a company owns or controls, which is expected to provide future financial benefits. In the balance sheet example provided in Figure 1.4, assets include those that contribute to the value of the business. These include current assets, property, plant and equipment, and intangible assets. The assets of a company generally depend on the business type, and the components can change as necessary.
- Liability: This is a debt that a business is obligated to settle in the future. It represents a future outflow of resources. In simple terms, liabilities (and equity, which we will discuss next) are the sources of funding used to acquire assets.
- Equity: This is what remains from the assets after all liabilities are paid. Types of equity include common stock, preferred stock, and retained earnings. Simply put, this is the portion that the owners have claim to once the liabilities are settled. When money is invested in a company, it can be in the form of common stock or preferred stock. Then, when the business starts generating profits, the earnings are kept in the business and placed under the retained earnings account. A portion of the earnings are sometimes paid to the stockholders in the form of dividends.
A balance sheet needs to be always balanced. The reason it is called a balance sheet is that the assets should equal the sum of the liabilities and shareholders’ equity. The liabilities and equity fund the acquisition of business assets, which is why they need to be equal. Insufficient data, erroneous transactions, mistakes in inventory, or errors in equity calculations could likely be the causes of an unbalanced balance sheet.
Important note
If errors in the balance sheet go undetected and the balance sheet is publicly reported, this can potentially negatively impact the company’s reputation as well as its stock price.
Profit and loss statement
Business owners and accountants utilize the profit and loss (P&L) report as an essential financial statement. Based on the company’s revenues and expenses, the report details its net profit or loss. It describes how a company can generate revenue and earn income after deducting the expenses.
Now, let’s look at a quick example of a P&L report. Figure 1.5 shows an example P&L statement for our fictional company where we break down its sales, cost of goods sold, expenses, and net income.
Figure 1.5 – Statement of Profit and Loss
Here, we can see that the business earns its revenue from selling products. Then, the cost of goods sold is deducted from the sales to get the gross profit. Afterward, the expenses needed to run the business—such as the franchise fee, insurance, maintenance, and taxes—are deducted to get the net income after taxes. The P&L report also enables the company to look at net income and overall profitability to determine how to best manage its resources.
Note
Imagine we have a P&L report where the cost of goods sold account increased significantly without the corresponding increase in revenue. This could flag the presence of potential data integrity issues, which affect the accuracy of the profitability of the business. As mentioned earlier, one way to ensure the integrity of financial statements in general—and the profit and loss statement in particular—is to set up internal controls. Take, for example, internal controls for payroll in order to minimize the risk of fraud. Some examples of this are the segregation of duties between the timesheet approver, payroll processor, and payroll issuer, having a different bank account for payroll, and comparing the actual payroll expense with the budget. This will be discussed in more detail in Chapter 7, Detecting Fraudulent Transactions Affecting Financial Report Integrity.
Cash flow statement
Another major financial report in addition to the balance sheet and the P&L statement is the cash flow statement. This report connects the balance sheet and income statement because it shows how money flows in and out of the company. The beginning and ending balances of this account, as well as the change in the cash level over the course of the period, are shown in the cash flow statement.
Figure 1.6 shows the statement of cash flows for an imaginary company. The cash flow statement can be prepared in two ways, either through the direct method or indirect method. The direct method details the incoming and outgoing cash flows from operations, while the indirect method presents cash flows using the net income as a starting point.
Figure 1.6 – Statement of Cash Flows
In our example, we are using the indirect method where we calculated the Net cash provided by operating activities by starting with the net income and adjusting it based on changes in cash provided by operating activities.
The statement also shows the Cash flows from investing activities, such as funding capital expenditures or selling equipment, and Cash flows from financing activities, such as proceeds from issuing debt or paying dividends to stockholders.
Important note
The financial reports that we have discussed are external reports and there are various internal reports that are generated and processed by internal teams to address a company’s specific needs and ensure that reviews, integrity checks, and analysis are done properly.
Budgeting
Budgeting is the process of creating a plan or estimate regarding the expected revenue and expenses in the future. This may also involve establishing financial objectives and allocating the required resources to meet these objectives. In order to plan effectively for the future and allocate resources properly, it is essential to make sure that the budgeting data is accurate.
Note
For example, a payroll employee creates a ghost employee in the payroll system and is able to pocket these payments. Having a budgeting process would help flag data integrity issues and potentially fraudulent transactions. That being said, we can estimate how much the expected payroll cost each month would be. Having a different payroll employee compare the budget with the actual payouts will assist in identifying a ghost employee.
Forecasting
Making projections about potential financial outcomes based on past performance and other relevant information is known as forecasting. Ensuring data integrity is critical to the forecasting process since accurate and reliable data is needed as a basis for the projections. One of the key activities done by companies while facing the challenges during COVID-19 was cash flow forecasting. Gaining visibility over cash flows allows a company to manage and run its operations effectively and have a buffer, especially when things do not go necessarily as planned. For the cash flow forecast to be effective and useful, the underlying data needs to be reliable and accurate. Thus, there is a need for data quality and integrity.
Say, for example, the business owners wanted to invest USD 200,000 to install building improvements that would make the business more profitable in the long run. However, they want to make sure that this is feasible and would not put the business in a dire situation in terms of cash flow. Assuming that the management expects to have the same level of operations for next year and pay the same level of dividends as 2023, we can create a simple statement of cash flow forecast, as seen in Figure 1.7:
Figure 1.7 – A simple cash flow statement forecast
We can see that holding other assumptions constant, by investing USD 200,000 into improvements will lead to a negative cash flow of USD 71, 826 for the business. This means that the cash flow from operations will not be sufficient to fund the expenditure, and the business needs to take on a loan from a bank or issue additional stock to finance the spending.
But what if there was an error in the calculation of the USD 140,386 income caused by an oversight in recording the revenue for Product B? Suppose in 2022, we have a revenue of USD 85,000 for this product (as seen in Figure 1.5), but we should have USD 65,000 instead. This mistake will cause a ripple effect in the calculation of the gross profit as well as the net income before taxes and taxes on income. Given that we assumed that for 2023 we will have the same level of operations as 2022 for simplicity in our example, this will have an effect on how much cash needs to be borrowed or raised to fund the investment in the building improvements.
Note
With advancements in technology and artificial intelligence, companies have started using machine learning in forecasting. Similar to the cash flow forecast just seen, for the machine learning forecast to be useful, the training data needs to be reliable and accurate.
Depreciation
Companies can recognize and report the decline in the value of fixed assets over time on their financial statements through depreciation. It is computed using the asset’s cost, expected useful life, and any residual value. This way, depreciation allows businesses to more precisely reflect the true cost of utilizing an asset over its useful life rather than being expensed in full on a one-time basis. At the same time, it helps reflect the cost of utilizing the asset in the same period when revenue was generated. Depreciation is shown as an expense on the income statement and as a decrease in the asset’s value on the balance sheet:
Figure 1.8 – Building depreciation schedule
Figure 1.8 details the building depreciation schedule allocating the cost of the asset over its useful life of 10 years. It can be gathered from the details that the building was acquired in 2021 at a value of USD 260,000 and expected to have a resale value of zero by 2031, when it will be fully utilized.
Newly constructed buildings usually last 40 years. Given that it will be used for 10 years, we can conclude that it is not a new one. The initial cost of the asset can differ depending on whether the business is following Generally Accepted Accounting Principles (GAAP) or International Financial Reporting Standards (IFRS). However, in this example, we will assume that there is no difference for simplicity. Also, estimates regarding how long an asset can be used, the depreciation method, and how much it can be sold for are approximations that could change depending on expectations.
Important note
It is critical to ensure the integrity and correctness of depreciation estimates. If, for example, there was an error in the useful life or depreciation method used to calculate the depreciation expense, this will have a flow-on effect (that is, a ripple effect) on the P&L statement values and calculations. Once these data integrity issues are detected, they could lead to restatements of prior periods' financial statements, particularly when they contain material inaccuracies.
For more information on this topic, feel free to check the following link: https://www.investopedia.com/terms/r/restatement.asp.
Variable cost
While fixed costs stay the same regardless of the amount of production or sales, variable costs fluctuate based on the quantity of goods or services produced or sold. Materials, direct labor costs, and packaging costs are examples of variable costs, sometimes referred to as direct expenses, which are directly associated with the production or sale of a certain good or service. Office rent, administrative staff pay, and real estate taxes are examples of fixed costs.
Important note
Understanding the difference between variable and fixed costs is critical for financial management since it enables a business to determine pricing and profitability while also understanding the entire cost of production or sales at various output levels. Knowing the trend of the variable cost for a product helps identify any irregularities early in the process. If, for example, the variable costs significantly increased, it can mean that the cost of producing the product has increased, or it could indicate errors, inconsistencies, or potentially even fraud.
Risk management
Risk management is the process of identifying, assessing, and minimizing the different risks and threats (including data integrity issues) that could potentially impact an organization’s business operations and financial performance. It is necessary that accurate and reliable data is available in order to identify and evaluate these threats. You can also use this data in the development of effective risk mitigation strategies. In Chapter 2, Avoiding Common Data Integrity Issues and Challenges in Finance Teams, we will discuss how one of the biggest banks globally failed to perform the appropriate risk assessments of possible data corruption for their intelligent deposit machines (IDMs), which led to significant financial penalties.
Insurance
Insurance is a legal contract between two parties where the insurer provides financial coverage for any loss that the insured may suffer from an unforeseen loss. In addition to this, filing for an insurance claim causes business interruptions and requires that the data is current, comprehensive, and accurate.
Fraud in the insurance industry costs billions of dollars per year. One way to address this risk is to improve and manage data integrity by ensuring that the data can be trusted and that the records are up to date and correct.
Note
We will cover how fraud can negatively affect the integrity of financial reports under Chapter 7, Detecting Fraudulent Transactions Affecting Financial Report Integrity.
Transaction
A transaction is an exchange between two parties representing a transfer of resources. Each transaction is recorded as an entry in its financial records. Some examples are selling a product to customers, paying the rent on the company office, paying the wages and salaries of employees, or buying equipment needed for production. One way to help ensure data integrity at a transaction level is by using database locking techniques and features available in database applications. These techniques would make sure that updates performed during transactions are correctly reflected (that is, the numbers are adding up correctly!). We’ll cover these techniques in more detail in Chapter 8, Using Database Locking Techniques for Financial Transaction Integrity.
In addition to this, database transactions can be stored and audited using various solutions including ledger databases such as Amazon Quantum Ledger Database (QLDB). Using these special types of databases, even if transactions are deleted, they can still be validated and audited using the features provided by the ledger database. We will discuss this further in Chapter 9, Using Managed Ledger Databases for Finance Data Integrity.
Mutual exclusion
Another technique to help ensure data integrity in system transactions is to utilize a mutual exclusion lock (or mutex lock), which prevents simultaneous access to a common entity or resource. Imagine having a bank account that contains a total of $200
. Suppose two separate deposits are initiated at the same time—one deposit of $50
and another of $100
. In this scenario, a mutex would lock the account for one deposit transaction, preventing the other from accessing the account simultaneously. This ensures accurate updating of the account balance, as each deposit is processed in isolation. Once the first deposit ($50
) is completed and the account balance is updated, the mutex unlocks, allowing the second deposit ($100
) to proceed, thereby maintaining the integrity of the transaction process as well as the final account balance (which should be $350
). Without a lock, you might end up having an incorrect final account balance of $250
!
To help demonstrate how effective these types of locks are when building financial systems, let’s have a quick example where multiple running processes or threads are updating a shared resource (first without a mutex) using the Python programming language:
import threading counter = 0 def increment_counter(): global counter for _ in range(100000): counter += 1 threads = [] for i in range(10): thread = threading.Thread(target=increment_counter) threads.append(thread) thread.start() for thread in threads: thread.join() print(f"Final counter value without mutex: {counter}")
Here, we have a shared counter
variable as well as multiple threads updating the counter at the same time. What happens if we run this code? Without a mutex lock, the concurrent threads will try to update the counter at the same time and accidentally overwrite the updates performed by other threads!
Note
The expected final value of the counter is 1,000,000. However, without a mutex lock, you will most likely get a value less than 1,000,000 (which is incorrect). You would be surprised that running the code multiple times may yield different results as well! In case you want to run this example yourself, make sure to (1) install Python 3.7.X
on your laptop/local machine, (2) create a file named no_mutex.py
, and (3) run python3 no_mutex.py
in your terminal application. Alternatively, you can run the code on websites such as https://www.online-python.com/.
Now, let’s update the previous example and have it use a mutex lock:
import threading counter = 0 counter_lock = threading.Lock() def increment_counter(): global counter for _ in range(100000): with counter_lock: counter += 1 threads = [] for i in range(10): thread = threading.Thread(target=increment_counter) threads.append(thread) thread.start() for thread in threads: thread.join() print(f"Final counter value with mutex: {counter}")
This time, we will get the correct final value of the counter
variable (1,000,000) since the usage of a mutex lock helped ensure that an update to the same resource is completed first before the next update operation is performed. Feel free to run the code multiple times and you should see that the result should always be the same (that is, 1,000,000). Awesome, right?
Note
For more information about this topic, feel free to check the following page: https://en.wikipedia.org/wiki/Lock_(computer_science).
Now that we’ve established a good foundation with these concepts, let’s move on to the final section of this chapter.