Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
50 Algorithms Every Programmer Should Know

You're reading from   50 Algorithms Every Programmer Should Know Tackle computer science challenges with classic to modern algorithms in machine learning, software design, data systems, and cryptography

Arrow left icon
Product type Paperback
Published in Sep 2023
Publisher Packt
ISBN-13 9781803247762
Length 538 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Imran Ahmad Imran Ahmad
Author Profile Icon Imran Ahmad
Imran Ahmad
Arrow right icon
View More author details
Toc

Table of Contents (22) Chapters Close

Preface 1. Section 1: Fundamentals and Core Algorithms FREE CHAPTER
2. Overview of Algorithms 3. Data Structures Used in Algorithms 4. Sorting and Searching Algorithms 5. Designing Algorithms 6. Graph Algorithms 7. Section 2: Machine Learning Algorithms
8. Unsupervised Machine Learning Algorithms 9. Traditional Supervised Learning Algorithms 10. Neural Network Algorithms 11. Algorithms for Natural Language Processing 12. Understanding Sequential Models 13. Advanced Sequential Modeling Algorithms 14. Section 3: Advanced Topics
15. Recommendation Engines 16. Algorithmic Strategies for Data Handling 17. Cryptography 18. Large-Scale Algorithms 19. Practical Considerations 20. Other Books You May Enjoy
21. Index

Practical applications

The ability to efficiently and accurately search data in a given data repository is critical to many real-life applications. Depending on your choice of searching algorithm, you may need to sort the data first as well. The choice of the right sorting and searching algorithms will depend on the type and the size of the data, as well as the nature of the problem you are trying to solve.

Let’s try to use the algorithms presented in this chapter to solve the problem of matching a new applicant at the immigration department of a certain country with historical records. When someone applies for a visa to enter the country, the system tries to match the applicant with the existing historical records. If at least one match is found, then the system further calculates the number of times that the individual has been approved or refused in the past. On the other hand, if no match is found, the system classes the applicant as a new applicant and issues them a new identifier.

The ability to search, locate, and identify a person in the historical data is critical for the system. This information is important because if someone has applied in the past and the application is known to have been refused, then this may affect that individual’s current application in a negative way. Similarly, if someone’s application is known to have been approved in the past, this approval may increase the chances of that individual getting approval for their current application. Typically, the historical database will have millions of rows, and we will need a well-designed solution to match new applicants in the historical database.

Let’s assume that the historical table in the database looks like the following:

Personal ID

Application ID

First name

Surname

DOB

Decision

Decision date

45583

677862

John

Doe

2000-09-19

Approved

2018-08-07

54543

877653

Xman

Xsir

1970-03-10

Rejected

2018-06-07

34332

344565

Agro

Waka

1973-02-15

Rejected

2018-05-05

45583

677864

John

Doe

2000-09-19

Approved

2018-03-02

22331

344553

Kal

Sorts

1975-01-02

Approved

2018-04-15

In this table, the first column, Personal ID, is associated with each of the unique applicants in the historical database. If there are 30 million unique applicants in the historical database, then there will be 30 million unique personal IDs. Each personal ID identifies an applicant in the historical database system.

The second column we have is Application ID. Each application ID identifies a unique application in the system. A person may have applied more than once in the past. So, this means that, in the historical database, we will have more unique application IDs than personal IDs. John Doe will only have one personal ID but has two application IDs, as shown in the preceding table.

The preceding table only shows a sample of the historical dataset. Let’s assume that we have close to 1 million rows in our historical dataset, which contains the records of the last 10 years of applicants. New applicants are continuously arriving at the average rate of around 2 applicants per minute. For each applicant, we need to do the following:

  • Issue a new application ID for the applicant.
  • See if there is a match with an applicant in the historical database.
  • If a match is found, use the personal ID for that applicant, as found in the historical database. We also need to determine how many times the application has been approved or refused in the historical database.
  • If no match is found, then we need to issue a new personal ID for that individual.

Suppose a new person arrives with the following credentials:

  • First Name: John
  • Surname: Doe
  • DOB: 2000-09-19

Now, how can we design an application that can perform an efficient and cost-effective search?

One strategy for searching the new application in the database can be devised as follows:

  1. Sort the historical database by DOB.
  2. Each time a new person arrives, issue a new application ID to the applicant.
  3. Fetch all the records that match that date of birth. This will be the primary search.
  4. Out of the records that have come up as matches, perform a secondary search using the first and last name.
  5. If a match is found, use Personal ID to refer to the applicants. Calculate the number of approvals and refusals.
  6. If no match is found, issue a new personal ID to the applicant.

Let’s try choosing the right algorithm to sort the historical database. We can safely rule out bubble sort as the size of the data is huge. Shell sort will perform better, but only if we have partially sorted lists. So, merge sort may be the best option for sorting the historical database.

When a new person arrives, we need to search for and locate that person in the historical database. As the data is already sorted, either interpolation search or binary search can be used. Because applicants are likely to be equally spread out, as per DOB, we can safely use binary search.

Initially, we search based on DOB, which returns a set of applicants sharing the same date of birth. Now, we need to find the required person within the small subset of people who share the same date of birth. As we have successfully reduced the data to a small subset, any of the search algorithms, including bubble sort, can be used to search for the applicant. Note that we have simplified the secondary search problem here a bit. We also need to calculate the total number of approvals and refusals by aggregating the search results, if more than one match is found.

In a real-world scenario, each individual needs to be identified in the secondary search using some fuzzy search algorithm, as the first and last names may be spelled slightly differently. The search may need to use some kind of distance algorithm to implement the fuzzy search, where the data points whose similarity is above a defined threshold are considered the same.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime