Practical applications
The ability to efficiently and accurately search data in a given data repository is critical to many real-life applications. Depending on your choice of searching algorithm, you may need to sort the data first as well. The choice of the right sorting and searching algorithms will depend on the type and the size of the data, as well as the nature of the problem you are trying to solve.
Let’s try to use the algorithms presented in this chapter to solve the problem of matching a new applicant at the immigration department of a certain country with historical records. When someone applies for a visa to enter the country, the system tries to match the applicant with the existing historical records. If at least one match is found, then the system further calculates the number of times that the individual has been approved or refused in the past. On the other hand, if no match is found, the system classes the applicant as a new applicant and issues them a new identifier.
The ability to search, locate, and identify a person in the historical data is critical for the system. This information is important because if someone has applied in the past and the application is known to have been refused, then this may affect that individual’s current application in a negative way. Similarly, if someone’s application is known to have been approved in the past, this approval may increase the chances of that individual getting approval for their current application. Typically, the historical database will have millions of rows, and we will need a well-designed solution to match new applicants in the historical database.
Let’s assume that the historical table in the database looks like the following:
Personal ID |
Application ID |
First name |
Surname |
DOB |
Decision |
Decision date |
45583 |
677862 |
John |
Doe |
2000-09-19 |
Approved |
2018-08-07 |
54543 |
877653 |
Xman |
Xsir |
1970-03-10 |
Rejected |
2018-06-07 |
34332 |
344565 |
Agro |
Waka |
1973-02-15 |
Rejected |
2018-05-05 |
45583 |
677864 |
John |
Doe |
2000-09-19 |
Approved |
2018-03-02 |
22331 |
344553 |
Kal |
Sorts |
1975-01-02 |
Approved |
2018-04-15 |
In this table, the first column, Personal ID
, is associated with each of the unique applicants in the historical database. If there are 30 million unique applicants in the historical database, then there will be 30 million unique personal IDs. Each personal ID identifies an applicant in the historical database system.
The second column we have is Application ID
. Each application ID identifies a unique application in the system. A person may have applied more than once in the past. So, this means that, in the historical database, we will have more unique application IDs than personal IDs. John Doe will only have one personal ID but has two application IDs, as shown in the preceding table.
The preceding table only shows a sample of the historical dataset. Let’s assume that we have close to 1 million rows in our historical dataset, which contains the records of the last 10 years of applicants. New applicants are continuously arriving at the average rate of around 2 applicants per minute. For each applicant, we need to do the following:
- Issue a new application ID for the applicant.
- See if there is a match with an applicant in the historical database.
- If a match is found, use the personal ID for that applicant, as found in the historical database. We also need to determine how many times the application has been approved or refused in the historical database.
- If no match is found, then we need to issue a new personal ID for that individual.
Suppose a new person arrives with the following credentials:
First Name: John
Surname: Doe
DOB: 2000-09-19
Now, how can we design an application that can perform an efficient and cost-effective search?
One strategy for searching the new application in the database can be devised as follows:
- Sort the historical database by
DOB
. - Each time a new person arrives, issue a new application ID to the applicant.
- Fetch all the records that match that date of birth. This will be the primary search.
- Out of the records that have come up as matches, perform a secondary search using the first and last name.
- If a match is found, use
Personal ID
to refer to the applicants. Calculate the number of approvals and refusals. - If no match is found, issue a new personal ID to the applicant.
Let’s try choosing the right algorithm to sort the historical database. We can safely rule out bubble sort as the size of the data is huge. Shell sort will perform better, but only if we have partially sorted lists. So, merge sort may be the best option for sorting the historical database.
When a new person arrives, we need to search for and locate that person in the historical database. As the data is already sorted, either interpolation search or binary search can be used. Because applicants are likely to be equally spread out, as per DOB
, we can safely use binary search.
Initially, we search based on DOB
, which returns a set of applicants sharing the same date of birth. Now, we need to find the required person within the small subset of people who share the same date of birth. As we have successfully reduced the data to a small subset, any of the search algorithms, including bubble sort, can be used to search for the applicant. Note that we have simplified the secondary search problem here a bit. We also need to calculate the total number of approvals and refusals by aggregating the search results, if more than one match is found.
In a real-world scenario, each individual needs to be identified in the secondary search using some fuzzy search algorithm, as the first and last names may be spelled slightly differently. The search may need to use some kind of distance algorithm to implement the fuzzy search, where the data points whose similarity is above a defined threshold are considered the same.