You're reading from 50 Algorithms Every Programmer Should Know Tackle computer science challenges with classic to modern algorithms in machine learning, software design, data systems, and cryptography

Product type Paperback

Published in Sep 2023

Publisher Packt

ISBN-13 9781803247762

Length 538 pages

Edition 2nd Edition

Languages

Processing

Tools

Processing

Concepts

Data Structures and Algorithms

Author (1):

Imran Ahmad

View More author details

Table of Contents (22) Chapters

Preface

1. Section 1: Fundamentals and Core Algorithms FREE CHAPTER

2. Overview of Algorithms

3. Data Structures Used in Algorithms

4. Sorting and Searching Algorithms

5. Designing Algorithms

6. Graph Algorithms

7. Section 2: Machine Learning Algorithms

8. Unsupervised Machine Learning Algorithms

9. Traditional Supervised Learning Algorithms

10. Neural Network Algorithms

11. Algorithms for Natural Language Processing

12. Understanding Sequential Models

13. Advanced Sequential Modeling Algorithms

14. Section 3: Advanced Topics

15. Recommendation Engines

16. Algorithmic Strategies for Data Handling

17. Cryptography

18. Large-Scale Algorithms

19. Practical Considerations

20. Other Books You May Enjoy

21. Index

Practical applications

The ability to efficiently and accurately search data in a given data repository is critical to many real-life applications. Depending on your choice of searching algorithm, you may need to sort the data first as well. The choice of the right sorting and searching algorithms will depend on the type and the size of the data, as well as the nature of the problem you are trying to solve.

Let’s try to use the algorithms presented in this chapter to solve the problem of matching a new applicant at the immigration department of a certain country with historical records. When someone applies for a visa to enter the country, the system tries to match the applicant with the existing historical records. If at least one match is found, then the system further calculates the number of times that the individual has been approved or refused in the past. On the other hand, if no match is found, the system classes the applicant as a new applicant and issues them a new identifier.

The ability to search, locate, and identify a person in the historical data is critical for the system. This information is important because if someone has applied in the past and the application is known to have been refused, then this may affect that individual’s current application in a negative way. Similarly, if someone’s application is known to have been approved in the past, this approval may increase the chances of that individual getting approval for their current application. Typically, the historical database will have millions of rows, and we will need a well-designed solution to match new applicants in the historical database.

Let’s assume that the historical table in the database looks like the following:

Personal ID	Application ID	First name	Surname	DOB	Decision	Decision date
45583	677862	John	Doe	2000-09-19	Approved	2018-08-07
54543	877653	Xman	Xsir	1970-03-10	Rejected	2018-06-07
34332	344565	Agro	Waka	1973-02-15	Rejected	2018-05-05
45583	677864	John	Doe	2000-09-19	Approved	2018-03-02
22331	344553	Kal	Sorts	1975-01-02	Approved	2018-04-15

In this table, the first column, Personal ID, is associated with each of the unique applicants in the historical database. If there are 30 million unique applicants in the historical database, then there will be 30 million unique personal IDs. Each personal ID identifies an applicant in the historical database system.

The second column we have is Application ID. Each application ID identifies a unique application in the system. A person may have applied more than once in the past. So, this means that, in the historical database, we will have more unique application IDs than personal IDs. John Doe will only have one personal ID but has two application IDs, as shown in the preceding table.

The preceding table only shows a sample of the historical dataset. Let’s assume that we have close to 1 million rows in our historical dataset, which contains the records of the last 10 years of applicants. New applicants are continuously arriving at the average rate of around 2 applicants per minute. For each applicant, we need to do the following:

Issue a new application ID for the applicant.
See if there is a match with an applicant in the historical database.
If a match is found, use the personal ID for that applicant, as found in the historical database. We also need to determine how many times the application has been approved or refused in the historical database.
If no match is found, then we need to issue a new personal ID for that individual.

Suppose a new person arrives with the following credentials:

First Name: John
Surname: Doe
DOB: 2000-09-19

Now, how can we design an application that can perform an efficient and cost-effective search?

One strategy for searching the new application in the database can be devised as follows:

Sort the historical database by DOB.
Each time a new person arrives, issue a new application ID to the applicant.
Fetch all the records that match that date of birth. This will be the primary search.
Out of the records that have come up as matches, perform a secondary search using the first and last name.
If a match is found, use Personal ID to refer to the applicants. Calculate the number of approvals and refusals.
If no match is found, issue a new personal ID to the applicant.

Let’s try choosing the right algorithm to sort the historical database. We can safely rule out bubble sort as the size of the data is huge. Shell sort will perform better, but only if we have partially sorted lists. So, merge sort may be the best option for sorting the historical database.

When a new person arrives, we need to search for and locate that person in the historical database. As the data is already sorted, either interpolation search or binary search can be used. Because applicants are likely to be equally spread out, as per DOB, we can safely use binary search.

Initially, we search based on DOB, which returns a set of applicants sharing the same date of birth. Now, we need to find the required person within the small subset of people who share the same date of birth. As we have successfully reduced the data to a small subset, any of the search algorithms, including bubble sort, can be used to search for the applicant. Note that we have simplified the secondary search problem here a bit. We also need to calculate the total number of approvals and refusals by aggregating the search results, if more than one match is found.

In a real-world scenario, each individual needs to be identified in the secondary search using some fuzzy search algorithm, as the first and last names may be spelled slightly differently. The search may need to use some kind of distance algorithm to implement the fuzzy search, where the data points whose similarity is above a defined threshold are considered the same.