Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Tech Guides

852 Articles
article-image-building-a-scalable-postgresql-solution
Natasha Mathur
14 Apr 2019
12 min read
Save for later

Building a scalable PostgreSQL solution 

Natasha Mathur
14 Apr 2019
12 min read
The term  Scalability means the ability of a software system to grow as the business using it grows. PostgreSQL provides some features that help you to build a scalable solution but, strictly speaking, PostgreSQL itself is not scalable. It can effectively utilize the following resources of a single machine: It uses multiple CPU cores to execute a single query faster with the parallel query feature When configured properly, it can use all available memory for caching The size of the database is not limited; PostgreSQL can utilize multiple hard disks when multiple tablespaces are created; with partitioning, the hard disks could be accessed simultaneously, which makes data processing faster However, when it comes to spreading a database solution to multiple machines, it can be quite problematic because a standard PostgreSQL server can only run on a single machine.  In this article, we will look at different scaling scenarios and their implementation in PostgreSQL. The requirement for a system to be scalable means that a system that supports a business now, should also be able to support the same business with the same quality of service as it grows. This article is an excerpt taken from the book  'Learning PostgreSQL 11 - Third Edition' written by Andrey Volkov and Salahadin Juba. The book explores the concepts of relational databases and their core principles.  You’ll get to grips with using data warehousing in analytical solutions and reports and scaling the database for high availability and performance. Let's say a database can store 1 GB of data and effectively process 100 queries per second. What if with the development of the business, the amount of data being processed grows 100 times? Will it be able to support 10,000 queries per second and process 100 GB of data? Maybe not now, and not in the same installation. However, a scalable solution should be ready to be expanded to be able to handle the load as soon as it is needed. In scenarios where it is required to achieve better performance, it is quite common to set up more servers that would handle additional load and copy the same data to them from a master server. In scenarios where high availability is required, this is also a typical solution to continuously copy the data to a standby server so that it could take over in case the master server crashes.  Scalable PostgreSQL solution Replication can be used in many scaling scenarios. Its primary purpose is to create and maintain a backup database in case of system failure. This is especially true for physical replication. However, replication can also be used to improve the performance of a solution based on PostgreSQL. Sometimes, third-party tools can be used to implement complex scaling scenarios. Scaling for heavy querying Imagine there's a system that's supposed to handle a lot of read requests. For example, there could be an application that implements an HTTP API endpoint that supports the auto-completion functionality on a website. Each time a user enters a character in a web form, the system searches in the database for objects whose name starts with the string the user has entered. The number of queries can be very big because of the large number of users, and also because several requests are processed for every user session. To handle large numbers of requests, the database should be able to utilize multiple CPU cores. In case the number of simultaneous requests is really large, the number of cores required to process them can be greater than a single machine could have. The same applies to a system that is supposed to handle multiple heavy queries at the same time. You don't need a lot of queries, but when the queries themselves are big, using as many CPUs as possible would offer a performance benefit—especially when parallel query execution is used. In such scenarios, where one database cannot handle the load, it's possible to set up multiple databases, set up replication from one master database to all of them, making each them work as a hot standby, and then let the application query different databases for different requests. The application itself can be smart and query a different database each time, but that would require a special implementation of the data-access component of the application, which could look as follows: Another option is to use a tool called Pgpool-II, which can work as a load-balancer in front of several PostgreSQL databases. The tool exposes a SQL interface, and applications can connect there as if it were a real PostgreSQL server. Then Pgpool-II will redirect the queries to the databases that are executing the fewest queries at that moment; in other words, it will perform load-balancing: Yet another option is to scale the application together with the databases so that one instance of the application will connect to one instance of the database. In that case, the users of the application should connect to one of the many instances. This can be achieved with HTTP load-balancing: Data sharding When the problem is not the number of concurrent queries, but the size of the database and the speed of a single query, a different approach can be implemented. The data can be separated into several servers, which will be queried in parallel, and then the result of the queries will be consolidated outside of those databases. This is called data sharding. PostgreSQL provides a way to implement sharding based on table partitioning, where partitions are located on different servers and another one, the master server, uses them as foreign tables. When performing a query on a parent table defined on the master server, depending on the WHERE clause and the definitions of the partitions, PostgreSQL can recognize which partitions contain the data that is requested and would query only these partitions. Depending on the query, sometimes joins, grouping and aggregation could be performed on the remote servers. PostgreSQL can query different partitions in parallel, which will effectively utilize the resources of several machines. Having all this, it's possible to build a solution when applications would connect to a single database that would physically execute their queries on different database servers depending on the data that is being queried. It's also possible to build sharding algorithms into the applications that use PostgreSQL. In short, applications would be expected to know what data is located in which database, write it only there, and read it only from there. This would add a lot of complexity to the applications. Another option is to use one of the PostgreSQL-based sharding solutions available on the market or open source solutions. They have their own pros and cons, but the common problem is that they are based on previous releases of PostgreSQL and don't use the most recent features (sometimes providing their own features instead). One of the most popular sharding solutions is Postgres-XL, which implements a shared-nothing architecture using multiple servers running PostgreSQL. The system has several components: Multiple data nodes: Store the data A single global transaction monitor (GTM): Manages the cluster, provides global transaction consistency Multiple coordinator nodes: Supports user connections, builds query-execution plans, and interacts with the GTM and the data nodes Postgres-XL implements the same API as PostgreSQL, therefore the applications don't need to treat the server in any special way. It is ACID-compliant, meaning it supports transactions and integrity constraints. The COPY command is also supported. The main benefits of using Postgres-XL are as follows: It can scale to support more reading operations by adding more data nodes It can scale for to support more writing operations by adding more coordinator nodes The current release of Postgres-XL (at the time of writing) is based on PostgreSQL 10, which is relatively new The main downside of Postgres-XL is that it does not provide any high-availability features out of the box. When more servers are added to a cluster, the probability of the failure of any of them increases. That's why you should take care with backups or implement replication of the data nodes themselves. Postgres-XL is open source, but commercial support is available. Another solution worth mentioning is Greenplum. It's positioned as an implementation of a massive parallel-processing database, specifically designed for data warehouses. It has the following components: Master node: Manages user connections, builds query execution plans, manages transactions Data nodes: Store the data and perform queries Greenplum also implements the PostgreSQL API, and applications can connect to a Greenplum database without any changes. It supports transactions, but support for integrity constraints is limited. The COPY command is supported. The main benefits of Greenplum are as follows: It can scale to support more reading operations by adding more data nodes. It supports column-oriented table organization, which can be useful for data-warehousing solutions. Data compression is supported. High-availability features are supported out of the box. It's possible (and recommended) to add a secondary master that would take over in case a primary master crashes. It's also possible to add mirrors to the data nodes to prevent data loss. The drawbacks are as follows: It doesn't scale to support more writing operations. Everything goes through the single master node and adding more data nodes does not make writing faster. However, it's possible to import data from files directly on the data nodes. It uses PostgreSQL 8.4 in its core. Greenplum has a lot of improvements and new features added to the base PostgreSQL code, but it's still based on a very old release; however, the system is being actively developed. Greenplum doesn't support foreign keys, and support for unique constraints is limited. There are commercial and open source editions of Greenplum. Scaling for many numbers of connections Yet another use case related to scalability is when the number of database connections is great.  However, when a single database is used in an environment with a lot of microservices and each has its own connection pool, even if they don't perform too many queries, it's possible that hundreds or even thousands of connections are opened in the database. Each connection consumes server resources and just the requirement to handle a great number of connections can already be a problem, without even performing any queries. If applications don't use connection pooling and open connections only when they need to query the database and close them afterwards, another problem could occur. Establishing a database connection takes time—not too much, but when the number of operations is great, the total overhead will be significant. There is a tool, named PgBouncer, that implements a connection-pool functionality. It can accept connections from many applications as if it were a PostgreSQL server and then open a limited number of connections towards the database. It would reuse the same database connections for multiple applications' connections. The process of establishing a connection from an application to PgBouncer is much faster than connecting to a real database because PgBouncer doesn't need to initialize a database backend process for the session. PgBouncer can create multiple connection pools that work in one of the three modes: Session mode: A connection to a PostgreSQL server is used for the lifetime of a client connection to PgBouncer. Such a setup could be used to speed up the connection process on the application side. This is the default mode. Transaction mode: A connection to PostgreSQL is used for a single transaction that a client performs. That could be used to reduce the number of connections at the PostgreSQL side when only a few translations are performed simultaneously. Statement mode: A database connection is used for a single statement. Then it is returned to the pool and a different connection is used for the next statement. This mode is similar to the transaction mode, though more aggressive. Note that multi-statement transactions are not possible when statement mode is used. Different pools can be set up to work in different modes. It's possible to let PgBouncer connect to multiple PostgreSQL servers, thus working as a reverse proxy. The way PgBouncer could be used is represented in the following diagram: PgBouncer establishes several connections to the database. When an application connects to PgBouncer and starts a transaction, PgBouncer assigns an existing database connection to that application, forwards all SQL commands to the database, and delivers the results back. When the transaction is finished, PgBouncer will dissociate the connections, but not close them. If another application starts a transaction, the same database connection could be used. Such a setup requires configuring PgBouncer to work in transaction mode. PostgreSQL provides several ways to implement replication that would maintain a copy of the data from a database on another server or servers. This can be used as a backup or a standby solution that would take over in case the main server crashes. Replication can also be used to improve the performance of a software system by making it possible to distribute the load on several database servers. In this article, we discussed the problem of building scalable solutions based on PostgreSQL utilizing the resources of several servers. We looked at scaling for querying, data sharding, as well as scaling for many numbers of connections.  If you enjoyed reading this article and want to explore other topics, be sure to check out the book 'Learning PostgreSQL 11 - Third Edition'. Handling backup and recovery in PostgreSQL 10 [Tutorial] Understanding SQL Server recovery models to effectively backup and restore your database Saving backups on cloud services with ElasticSearch plugins
Read more
  • 0
  • 0
  • 17384

article-image-a-non-programmers-guide-to-learning-machine-learning
Natasha Mathur
05 Sep 2018
11 min read
Save for later

A non programmer’s guide to learning Machine learning

Natasha Mathur
05 Sep 2018
11 min read
Artificial intelligence might seem intimidating, but it isn’t actually as complex as you might think. Many of the tools that have been developed over the last decade or so have all helped to make artificial intelligence and machine learning more accessible to engineers with varying degrees of experience and knowledge. Today, we’ve got to a stage where it’s now accessible even to people who have barely written a line of code in their life! Pretty exciting, right? But if you’re completely new to the field, it can be challenging to know how to get started - fortunately, we’re about to help you overcome that first hurdle. If you are an AI denier, then be sure to first read ‘why learn Machine Learning as a non-techie’ before you move forward. A strong purpose and belief is the first step to learning anything new. Alright, now here’s how you can get started with artificial intelligence and machine learning techniques quickly. 0. Use a free MLaaS or a no code interactive machine learning tool to experience first hand what is possible with learning machine learning: Some popular examples of no code machine learning as a service option are Microsoft Azure, BigML, Orange, and Amazon ML. Read Q2 under the FAQ section below to know more on this topic. 1. Learn Linear Algebra: Linear Algebra is the elementary unit for ML. It helps you effectively comprehend the theory behind the Machine learning algorithms and how they work. It also improves your math skills such as statistics, programming skills, which are all other skills that helps in ML. Learning Resources: Linear Algebra for Beginners: Open Doors to Great Careers Linear algebra Basics 2. Learn just enough Python or any programming: Now, you can get started with any language of your interest, but we suggest Python as  it’s great for people who are new to programming. It’s easy to learn due to its simple syntax. You’ll be able to quickly implement the ML algorithms. Also,  It has a rich development ecosystem that offers a ton of libraries and frameworks in Machine Learning such as Scikit Learn, Lasagne, Numpy, Scipy, Theano, Tensorflow, etc. Learning Resources: Python Machine Learning Learn Python in 7 Days Python for Beginners 2017 [Video] Learn Python with codecademy Python editor for beginner programmers 3. Learn basic Probability Theory and statistics: A lot of fundamental Statistical and Probability Theories form the basis for ML. You’ve probably already learned Probability and statistics in school, it easy to dive into advanced statistics for ML. Machine learning in its currently widely used form is a way to predict odds and see patterns. Knowing statistics and probability is important as it will help you with better understanding of why any machine learning algorithm works. For example, your grounding in this area, will help to ask the right questions, choose the right set of algorithms and know what to expect as answers from your ML model on questions such as: What are the odds of this person also liking this movie given their current movie watching choices ( Collaborative filtering and content-based filtering) How similar is this user to that group of users who brought a bunch of stuff on my site (clustering, collaborative filtering, and classification) Could this person be at risk of cancer given a certain set of traits and health indicator observations (logistic regression) Should you buy that stock (decision tree) Also, check out our interview with James D. Miller to know more about why learning stats is important in this field. Learning resources: Statistics for Data Science [Video] 4. Learn machine learning algorithms: Do not get intimidated!  You don’t have to be an expert to learn ML algorithms. Knowing basic ML algorithms that are majorly used in the real world applications like linear regression, naive Bayes, and decision trees, are enough to get you started. Learn what they do and how they are used in Machine Learning. 5. Learn numpy sci-kit learn,Keras or any other popular machine learning framework: It can be confusing initially to decide which framework to learn. Each one has its own advantages and disadvantages. Numpy is a linear algebra library which is useful for performing mathematical and logical operations. You can easily work with large multidimensional arrays using Numpy. Sci-kit learn helps with quick implementation of popular algorithms on datasets as just one line of code makes different algorithms available for you. Keras is minimalistic and straightforward with high-levels of extensibility, so it is easier to approach. Learning Resources:  Hands-on Machine Learning with TensorFlow [Video]  Hands-on Scikit-learn for Machine Learning [Video] If you have reached till here, it is time to put your learning into practice. Go ahead and create a simple linear regression model using some publicly available dataset in your area of interest. Kaggle, ourworldindata.org, UC Irvine Machine Learning repository, elitedatascience, all have a rich set of clean datasets in varied fields. Now, it is necessary to commit and put in daily efforts to practise these skills. Quora, Reddit, Medium, and stackoverflow will be your best friends when it comes to solving doubts regarding any of these skills. Data Helpers is another great resource that provides newcomers with help on queries regarding entering the ML field and related topics. Additionally, once you start getting hang of these skills, identify your strengths and interests, to realign your career goals. Research on the kind of work you want to put your newly gained Machine Learning skill to use. It needn’t be professional or serious, it just needs to be something that you deeply care about or are passionate about. This will pull you through your learning milestones, should you feel low at some point. Also, don’t forget to collaborate with other people and learn from them. You can work with web developers, software programmers, data analysts, data administrators, game developers etc. Finally, keep yourself updated with all the latest happenings in the ML world. Follow top experts and influencers on social media, top blogs on Machine Learning, and conferences. Once you are done checking off these steps off your list, you’ll be ready to start off with your ML project.                                                  Now, we’ll be looking at the most frequently asked questions by beginners in the field of Machine learning. Frequently asked questions by Beginners in ML As a beginner, it’s natural to have a lot of questions regarding ML. We’ll be addressing the top three frequently asked questions by beginners or non-programmers when it comes to Machine learning: Q.1 I am looking to make a career in Machine learning but I have no prior programming experience. Do I need to know programming for Machine learning? In a nutshell, Yes. If you want a career in Machine learning then having some form of programming knowledge really helps. As mentioned earlier in this article, learning a programming language can really help you with implementing ML algorithms. It also lets you know the internal mechanism behind Machine learning. So, having programming as a prior skill is great. Again, as mentioned before, you can get started with Python which is the easiest and the most common languages for ML. However, programming is just a part of Machine learning. For instance, “machine learning engineers” typically write more code than develop models, while “research scientists” work more on modelling and analyzing different models. Now, ML is based on the principles of statistical inference and for talking statistically to the computer, we need a language, there comes Coding. So, even though the nature of your job in ML might not require you to code as much, there’s still some amount of coding required. Read Also: Why is Python so good for AI and ML? 5 Python Experts Explain Top languages for Artificial Intelligence development Q.2 Are there any tools that can help me with Machine learning without touching a single line of code? Yes. With the rise of MLaaS (Machine learning as a service), there are certain tools that help you get started with machine learning right-away. These are especially useful for business applications of ML, such as predictive modelling and clustering. Read Also: How MLaaS is transforming cloud Some of the most popular ones are: BigML:  This cloud based web-service lets you upload your data, prepare it and run algorithms on it. It’s great for people with not so extensive data science backgrounds. It offers a clean and easy to use interfaces for configuring algorithms (decision trees) and reviewing the results. Being focused “only” on Machine Learning, it comes with a wide set of features, all well integrated within a usable Web UI. Other than that, it also offers an API so that if you like it you can build an application around it. Microsoft Azure: The Microsoft Azure ML studio is a “GUI-based integrated development environment for constructing and operationalizing Machine Learning workflow on Azure”. So, via an integrated development environment called ML Studio, people without data science background or non-programmers can also build data models with the help of drag-and-drop gestures and simple data flow diagrams. This also saves a lot of time through ML Studio's library of sample experiments. Learning resources: Microsoft Azure Machine Learning Machine Learning In The Cloud With Azure ML[Video] Orange: This is an open source machine learning and data visualization studio for novice and experts alike. It provides a toolbox comprising of text mining (topic modelling) and image recognition. It also offers a design tool for visual programming which allows you to connect together data preparation, algorithms, and result evaluation, thereby, creating machine learning “programs”. Apart from that, it provides over 100 widgets for the environment and there’s also a Python API and library available which you can integrate into your application. Amazon ML: Amazon ML is a part of Amazon Web Services ( AWS ) that combines powerful machine learning algorithms with interactive visual tools to guide you towards easily creating, evaluating, and deploying machine learning models. So, whether you are a data scientist or a newbie, it offers ML services and tools tailored to meet your needs and level of expertise. Building ML models using Amazon ML consists of three operations: data analysis, model training, and evaluation. Learning Resources: Effective Amazon Machine Learning Q.3  Do I need to know advanced mathematics ( college graduate level ) to learn Machine learning? It depends. As mentioned earlier, understanding of the following mathematical topics: Probability, Statistics and Linear Algebra can really make your machine learning journey easier and also help simplify your code. These help you understand the “why” behind the working of the machine learning algorithms, which is quite fundamental to understanding ML. However, not knowing advanced mathematics is not an excuse to not learning Machine Learning. There a lot of libraries which makes the task of applying an ML algorithm to solve a task easier. One such example is the widely used Python’s scikit-learn library. With scikit-learn, you just need one line of code and you’ll have the most common algorithms there for you, ready to be used. But, if you want to go deeper into machine learning then knowing advanced mathematics is a prerequisite as it will help you understand the algorithms, the formulas, how the learning is done and many other Machine Learning concepts. Also, with so many courses and tutorials online, you can always learn advanced mathematics on the side while exploring Machine learning. So, we looked at the three most asked questions by beginners in the field of Machine Learning. In the past, machine learning has provided us with self-driving cars, effective web search, speech recognition, etc. Machine learning is extremely pervasive, in fact, many researchers believe that ML is the best way to make progress towards human-level AI. Learning ML is not an easy task but its not next to impossible either. In the end, it all depends on the amount of dedication and efforts that you’re willing to put in to get a grasp of it. We just touched the tip of the iceberg in this article, there’s a lot more to know in Machine Learning which you will get a hang of as you get your feet dirty in it. That being said, all the best for the road ahead! Facebook launches a 6-part ML video series 7 of the best ML conferences for the rest of 2018 Google introduces Machine Learning courses for AI beginners
Read more
  • 0
  • 0
  • 16892

article-image-6-artificial-intelligence-cybersecurity-tools-you-need-to-know
Savia Lobo
25 Aug 2018
7 min read
Save for later

6 artificial intelligence cybersecurity tools you need to know

Savia Lobo
25 Aug 2018
7 min read
Recently, most of the organizations experienced severe downfall due to an undetected malware, Deeplocker, which secretly evaded even the stringent cyber security mechanisms. Deeplocker leverages the AI model to attack the target host by using indicators such as facial recognition, geolocation and voice recognition. This incidence speaks volumes about the big role AI plays in the cybersecurity domain. In fact, some may even go on to say that AI for cybersecurity is no longer a nice to have tech rather a necessity. Large and small organizations and even startups are hugely investing in building AI systems to analyze the huge data trove and in turn, help their cybersecurity professionals to identify possible threats and take precautions or immediate actions to solve it. If AI can be used in getting the systems protected, it can also harm it. How? The hackers and intruders can also use it to launch an attack--this would be a much smarter attack--which would be difficult to combat. Phishing, one of the most common and simple social engineering cyber attack is now easy for attackers to master. There are a plethora of tools on the dark web that can help anyone to get their hands on phishing. In such trying conditions, it is only imperative that organizations take necessary precautions to guard their information castles. What better than AI? How 6 tools are using artificial intelligence for cybersecurity Symantec’s Targeted attack analytics (TAA) tool This tool was developed by Symantec and is used to uncover stealthy and targeted attacks. It applies AI and machine learning on the processes, knowledge, and capabilities of the Symantec’s security experts and researchers. The TAA tool was used by Symantec to counter the Dragonfly 2.0 attack last year. This attack targeted multiple energy companies and tried to gain access to operational networks. Eric Chein, Technical Director of Symantec Security says, “ With TAA, we’re taking the intelligence generated from our leading research teams and uniting it with the power of advanced machine learning to help customers automatically identify these dangerous threats and take action.” The TAA tools analyze incidents within the network against the incidents found in their Symantec threat data lake. TAA unveils suspicious activity in individual endpoints and collates that information to determine whether each action indicate hidden malicious activity. The TAA tools are now available for Symantec Advanced Threat Protection (ATP) customers. Sophos’ Intercept X tool Sophos is a British security software and hardware company. Its tool, Intercept X, uses a deep learning neural network that works similar to a human brain. In 2010, the US Defense Advanced Research Projects Agency (DARPA) created their first Cyber Genome Program to uncover the ‘DNA’ of malware and other cyber threats, which led to the creation of algorithm present in the Intercept X. Before a file executes, the Intercept X is able to extract millions of features from a file, conduct a deep analysis, and determine if a file is benign or malicious in 20 milliseconds. The model is trained on real-world feedback and bi-directional sharing of threat intelligence via an access to millions of samples provided by the data scientists. This results in high accuracy rate for both existing and zero-day malware, and a lower false positive rate. Intercept X utilizes behavioral analysis to restrict new ransomware and boot-record attacks.  The Intercept X has been tested on several third parties such as NSS labs and received high-scores. It is also proven on VirusTotal since August of 2016. Maik Morgenstern, CTO, AV-TEST said, “One of the best performance scores we have ever seen in our tests.” Darktrace Antigena Darktrace Antigena is Darktrace’s active self-defense product. Antigena expands Darktrace’s core capabilities to detect and replicate the function of digital antibodies that identify and neutralize threats and viruses. Antigena makes use of Darktrace’s Enterprise Immune System to identify suspicious activity and responds to them in real-time, depending on the severity of the threat. With the help of underlying machine learning technology, Darktrace Antigena identifies and protects against unknown threats as they develop. It does this without the need for human intervention, prior knowledge of attacks, rules or signatures. With such automated response capability, organizations can respond to threats quickly, without disrupting the normal pattern of business activity. Darktrace Antigena modules help to regulate user and machine access to the internet, message protocols and machine and network connectivity via various products such as Antigena Internet, Antigena Communication, and Antigena network. IBM QRadar Advisor IBM’s QRadar Advisor uses the IBM Watson technology to fight against cyber attacks. It uses AI to auto-investigate indicators of any compromise or exploit. QRadar Advisor uses cognitive reasoning to give critical insights and further accelerates the response cycle. With the help of IBM’s QRadar Advisor, security analysts can assess threat incidents and reduce the risk of missing them. Features of the IBM QRadar Advisor Automatic investigations of incidents QRadar Advisor with Watson investigates threat incidents by mining local data using observables in the incident to gather broader local context. It later quickly assesses the threats regarding whether they have bypassed layered defenses or were blocked. Provides Intelligent reasoning QRadar identifies the likely threat by applying cognitive reasoning. It connects threat entities related to the original incident such as malicious files, suspicious IP addresses, and rogue entities to draw relationships among these entities. Identifies high priority risks With this tool, one can get critical insights on an incident, such as whether or not a malware has executed, with supporting evidence to focus your time on the higher risk threats. Then make a decision quickly on the best response method for your business. Key insights on users and critical assets IBM’s QRadar can detect suspicious behavior from insiders through integration with the User Behavior Analytics (UBA) App and understands how certain activities or profiles impact systems. Vectra’s Cognito Vectra’s Cognito platform uses AI to detect attackers in real-time. It automates threat detection and hunts for covert attackers. Cognito uses behavioral detection algorithms to collect network metadata, logs and cloud events. It further analyzes these events and stores them to reveal hidden attackers in workloads and user/IoT devices. Cognito platform consists of Cognito Detect and Cognito Recall. Cognito Detect reveals hidden attackers in real time using machine learning, data science, and behavioral analytics. It automatically triggers responses from existing security enforcement points by driving dynamic incident response rules. Cognito Recall determines exploits that exist in historical data. It further speeds up detection of incident investigations with actionable context about compromised devices and workloads over time. It’s a quick and easy fix to find all devices or workloads accessed by compromised accounts and identify files involved in exfiltration. Just as diamond cuts diamond, AI cuts AI. By using AI to attack and to prevent on either side, AI systems will learn different and newer patterns and also identify unique deviations to security analysts. This provides organizations to resolve an attack on the way much before it reaches to the core. Given the rate at which AI and machine learning are expanding, the days when AI will redefine the entire cybersecurity ecosystem are not that far. DeepMind AI can spot over 50 sight-threatening eye diseases with expert accuracy IBM’s DeepLocker: The Artificial Intelligence powered sneaky new breed of Malware 7 Black Hat USA 2018 conference cybersecurity training highlights Top 5 cybersecurity trends you should be aware of in 2018  
Read more
  • 0
  • 0
  • 16658

article-image-webgl-20-what-you-need-know
Raka Mahesa
01 May 2017
5 min read
Save for later

WebGL 2.0: What you need to know

Raka Mahesa
01 May 2017
5 min read
Earlier this year, Google and Mozilla released a version of Chrome and Firefox that has full support for WebGL 2.0. While some of the previous versions of their browsers also have support for WebGL 2.0, those versions by default disable the WebGL 2.0 feature. By enabling WebGL 2.0 in their latest browser version, it seems both Google and Mozilla are confident that this bleeding edge web technology can finally be used by most users without any problems.  So, what is WebGL 2.0? How does it differ from the previous version of WebGL? What, in fact, is WebGL?  To answer those questions, let's go back in time a little bit. In the early 1990s, graphics intensive applications were expensive to create because the software had to be customized for each type of graphic processing hardware. Imagine having to write an app for each smartphone vendor separately; it would cost many man hours. So, to mitigate this problem, a standard for graphics computing was introduced. This standard is called OpenGL (which stands for Open Graphics Library).  When mobile phones with display screens were introduced, people realized that mobile technology also needed a standard for graphics computing. However, OpenGL is a standard primarily for desktop-class hardware, so they realized that they would need a different standard that could work with the limited capability of mobile hardware. And thus OpenGL ES (Embedded System) was branched out from OpenGL, and the initial version was released in the early 2000s.  The same progression happened to web technology. In 2009, web applications became increasingly graphic-intensive, so a graphical standard called WebGL was introduced to help software developers. One thing noted, however, was that users could access web applications from both desktop and mobile devices, so WebGL needed to work on both platforms. To accommodate that, WebGL was created based on the OpenGL ES specification instead of the desktop-focused OpenGL.  Technology keeps advancing. As graphics hardware becomes more capable, additional features get added to the graphical standards. The latest version of OpenGL ES, version 3.0, was released in 2012 to keep up with the advancement in mobile GPU. WebGL 1.0, however, was still based on OpenGL ES 2.0. So in 2017, the specification for WebGL 2.0, which was based on OpenGL ES 3.0, was finally released.  As we can see from the timeline, WebGL 2.0 is really fresh out the oven. In fact, it's so new, that at the time of writing this article, the only browsers that support the standards are Google Chrome, Mozilla Firefox, and the Opera browser. WebGL 2.0 support on Safari is still under development. Also, it's worth noting that no mobile browser supports WebGL 2.0 by default (WebGL 2.0 support on Chrome for Android can be enabled via a hidden menu).  Considering the limited number of compatible platforms, as developers, we really can't rely on the user to have the necessary browser for our apps. So, with that limitation in mind, we have to always check for the browser's capability and prepare a fallback method in the event that the browser does not support WebGL 2.0.  So, how does WebGL version 2.0 differ from version 1.0? Fortunately, nothing major has changed with the way the library is used. This latest version of WebGL simply adds additional features and also makes some optional extensions of the library to be included by default.  One of the WebGL 1.0 extensions that have been made mandatory on WebGL 2.0 is the Instancing extension, which enables developers to render multiple copies of the same mesh efficiently. This feature is very useful for drawing decorative objects, like grass. Another extension that has been included in WebGL 2.0 is Depth Texture, which is used a lot for computing lighting and creating shadow maps.  Another major addition to WebGL 2.0 is the support for GLSL 3.0 ES, the latest programming language for the OpenGL shader. With this version of GLSL, a loop in the shader is no longer restricted to a constant integer. Not just that, GLSL 3.0 ES also brings additional matrix operations (like an inverse function) that will make coding a shader much easier.  WebGL 2.0 also offers much better support for textures. With version 2.0, the non-power of 2D textures are finally supported, which means the size of your texture image is no longer limited to 32, 64, 128, 256, and such. 3D textures are also supported now, which is pretty useful for volumetric effects such as light rays and smoke, as well as for storing medical scans.  WebGL 2.0 also adds support for more texture formats such as RGBA32, RGBA16, R11F_G11F_B10F, SRGB8, and others. More compressed texture formats that are not platform-specific are also supported, including: COMPRESSED_RGB8_ETC2, COMPRESSED_RGBA8_ETC2_EAC, and more.  There are other additions to WebGL 2.0, such as Multiple Draw Buffer, Transform Feedback, Uniform Buffer Object, and more. To learn about these and much more, see the official WebGL 2.0 specifications to check out all those additions in detail.  About this author Raka Mahesa is a game developer at Chocoarts: http://chocoarts.com/, who is interested in digital technology in general. Outside of work hours, he likes to work on his own projects, with Corridoom VR being his latest released game. Raka also regularly tweets as @legacy99. 
Read more
  • 0
  • 0
  • 16036

article-image-vulnerabilities-in-the-application-and-transport-layer-of-the-tcp-ip-stack
Melisha Dsouza
07 Feb 2019
15 min read
Save for later

Vulnerabilities in the Application and Transport Layer of the TCP/IP stack

Melisha Dsouza
07 Feb 2019
15 min read
The Transport layer is responsible for end-to-end data communication and acts as an interface for network applications to access the network. This layer also takes care of error checking, flow control, and verification in the TCP/IP  protocol suite. The Application Layer handles the details of a particular application and performs 3 main tasks- formatting data, presenting data and transporting data.  In this tutorial, we will explore the different types of vulnerabilities in the Application and Transport Layer. This article is an excerpt from a book written by Glen D. Singh, Rishi Latchmepersad titled CompTIA Network+ Certification Guide This book covers all CompTIA certification exam topics in an easy-to-understand manner along with plenty of self-assessment scenarios for better preparation. This book will not only prepare you conceptually but will also help you pass the N10-007 exam. Vulnerabilities in the Application Layer The following are some of the application layer protocols which we should pay close attention to in our network: File Transfer Protocol (FTP) Telnet Secure Shell (SSH) Simple Mail Transfer Protocol (SMTP) Domain Name System (DNS) Dynamic Host Configuration Protocol (DHCP) Hypertext Transfer Protocol (HTTP) Each of these protocols was designed to provide the function it was built to do and with a lesser focus on security. Malicious users and hackers are able to compromise both the application that utilizes these protocols and the network protocols themselves. Cross Site Scripting (XSS) XSS focuses on exploiting a weakness in websites. In an XSS attack, the malicious user or hacker injects client-side scripts into a web page/site that a potential victim would trust. The scripts can be JavaScript, VBScript, ActiveX, and HTML, or even Flash (ActiveX), which will be executed on the victim's system. These scripts will be masked as legitimate requests between the web server and the client's browser. XSS focuses on the following: Redirecting a victim to a malicious website/server Using hidden Iframes and pop-up messages on the victim's browser Data manipulation Data theft Session hijacking Let's take a deeper look at what happens in an XSS attack: An attacker injects malicious code into a web page/site that a potential victim trusts. A trusted site can be a favorite shopping website, social media platform, or school or university web portal. A potential victim visits the trusted site. The malicious code interacts with the victim's web browser and executes. The web browser is usually unable to determine whether the scripts are malicious or not and therefore still executes the commands. The malicious scripts can be used obtain cookie information, tokens, session information, and so on about other websites that the browser has stored information about. The acquired details (cookies, tokens, sessions ID, and so on) are sent back to the hacker, who in turn uses them to log in to the sites that the victim's browser has visited: There are two types of XSS attacks: Stored XSS (persistent) Reflected (non-persistent) Stored XSS (persistent): In this attack, the attacker injects a malicious script directly into the web application or a website. The script is stored permanently on the page, so when a potential victim visits the compromised page, the victim's web browser will parse all the code of the web page/application fine. Afterward, the script is executed in the background without the victim's knowledge. At this point, the script is able to retrieve session cookies, passwords, and any other sensitive information stored in the user's web browser, and sends the loot back to the attacker in the background. Reflective XSS (non-persistent): In this attack, the attacker usually sends an email with the malicious link to the victim. When the victim clicks the link, it is opened in the victim's web browser (reflected), and at this point, the malicious script is invoked and begins to retrieve the loot (passwords, credit card numbers, and so on) stored in the victim's web browser. SQL injection (SQLi) SQLi attacks focus on parsing SQL commands into an SQL database that does not validate the user input. The attacker attempts to gain unauthorized access to a database either by creating or retrieving information stored in the database application. Nowadays, attackers are not only interested in gaining access, but also in retrieving (stealing) information and selling it to others for financial gain. SQLi can be used to perform: Authentication bypass: Allows the attacker to log in to a system without a valid user credential Information disclosure: Retrieves confidential information from the database Compromise data integrity: The attacker is able to manipulate information stored in the database Lightweight Directory Access Protocol (LDAP) injection LDAP is designed to query and update directory services, such as a database like Microsoft Active Directory. LDAP uses both TCP and UDP port 389 and LDAP uses port 636. In an LDAP injection attack, the attacker exploits the vulnerabilities within a web application that constructs LDAP messages or statements, which are based on the user input. If the receiving application does not validate or sanitize the user input, this increases the possibility of manipulating LDAP messages. Cross-Site Request Forgery (CSRF) This attack is a bit similar to the previously mentioned XSS attack. In a CSRF attack, the victim machine/browser is forced to execute malicious actions against a website with which the victim has been authenticated (a website that trusts the actions of the user). To have a better understanding of how this attack works, let's visualize a potential victim, Bob. On a regular day, Bob visits some of his favorite websites, such as various blogs, social media platforms, and so on, where he usually logs in automatically to view the content. Once Bob logs in to a particular website, the website would automatically trust the transactions between itself and the authenticated user, Bob. One day, he receives an email from the attacker but unfortunately Bob does not realize the email is a phishing/spam message and clicks on the link within the body of the message. His web browser opens the malicious URL in a new tab: The attack would cause Bob's machine/web browser to invoke malicious actions on the trusted website; the website would see all the requests are originating from Bob. The return traffic such as the loot (passwords, credit card details, user account, and so on) would be returned to the attacker. Session hijacking When a user visits a website, a cookie is stored in the user's web browser. Cookies are used to track the user's preferences and manage the session while the user is on the site. While the user is on the website, a session ID is also set within the cookie, and this information may be persistent, which allows a user to close the web browser and then later revisit the same website and automatically log in. However, the web developer can set how long the information is persistent for, whether it expires after an hour or a week, depending on the developer's preference. In a session hijacking attack, the attacker can attempt to obtain the session ID while it is being exchanged between the potential victim and the website. The attacker can then use this session ID of the victim on the website, and this would allow the attacker to gain access to the victim's session, further allowing access to the victim's user account and so on. Cookie poisoning A cookie stores information about a user's preferences while he/she is visiting a website. Cookie poisoning is when an attacker has modified a victim's cookie, which will then be used to gain confidential information about the victim such as his/her identity. DNS Distributed Denial-of-Service (DDoS) A DDoS attack can occur against a DNS server. Attacker sometimes target Internet Service Providers (ISPs) networks, public and private Domain Name System (DNS) servers, and so on to prevent other legitimate users from accessing the service. If a DNS server is unable to handle the amount of requests coming into the server, its performance will eventually begin to degrade gradually, until it either stops responding or crashes. This would result in a Denial-of-Service (DoS) attack. Registrar hijacking Whenever a person wants to purchase a domain, the person has to complete the registration process at a domain registrar. Attackers do try to compromise users accounts on various domain registrar websites in the hope of taking control of the victim's domain names. With a domain name, multiple DNS records can be created or modified to direct incoming requests to a specific device. If a hacker modifies the A record on a domain to redirect all traffic to a compromised or malicious server, anyone who visits the compromised domain will be redirected to the malicious website. Cache poisoning Whenever a user visits a website, there's the process of resolving a host name to an IP address which occurs in the background. The resolved data is stored within the local system in a cache area. The attacker can compromise this temporary storage area and manipulate any further resolution done by the local system. Typosquatting McAfee outlined typosquatting, also known as URL hijacking, as a type of cyber-attack that allows an attacker to create a domain name very close to a company's legitimate domain name in the hope of tricking victims into visiting the fake website to either steal their personal information or distribute a malicious payload to the victim's system. Let's take a look at a simple example of this type of attack. In this scenario, we have a user, Bob, who frequently uses the Google search engine to find his way around the internet. Since Bob uses the www.google.com website often, he sets it as his homepage on the web browser so each time he opens the application or clicks the Home icon, www.google.com is loaded onto the screen. One day Bob decides to use another computer, and the first thing he does is set his favorite search engine URL as his home page. However, he typed www.gooogle.com and didn't realize it. Whenever Bob visits this website, it looks like the real website. Since the domain was able to be resolved to a website, this is an example of how typosquatting works. It's always recommended to use a trusted search engine to find a URL for the website you want to visit. Trusted internet search engine companies focus on blacklisting malicious and fake URLs in their search results to help protect internet users such as yourself. Vulnerabilities at the Transport Layer In this section, we are going to discuss various weaknesses that exist within the underlying protocols of the Transport Layer. Fingerprinting In the cybersecurity world, fingerprinting is used to discover open ports and services that are running open on the target system. From a hacker's point of view, fingerprinting is done before the exploitation phase, as the more information a hacker can obtain about a target, the hacker can then narrow its attack scope and use specific tools to increase the chances of successfully compromising the target machine. This technique is also used by system/network administrators, network security engineers, and cybersecurity professionals alike. Imagine you're a network administrator assigned to secure a server; apart from applying system hardening techniques such as patching and configuring access controls, you would also need to check for any open ports that are not being used. Let's take a look at a more practical approach to fingerprinting in the computing world. We have a target machine, 10.10.10.100, on our network. As a hacker or a network security professional, we would like to know which TCP and UDP ports are open, the services that use the open ports, and the service daemon running on the target system. In the following screenshot, we've used nmap to help us discover the information we are seeking. The NMap tools delivers specially crafted probes to a target machine: Enumeration In a cyber attack, the hacker uses enumeration techniques to extract information about the target system or network. This information will aid the attacker in identifying system attack points. The following are the various network services and ports that stand out for a hacker: Port 53: DNS zone transfer and DNS enumeration Port 135: Microsoft RPC Endpoint Mapper Port 25: Simple Mail Transfer Protocol (SMTP) DNS enumeration DNS enumeration is where an attacker is attempting to determine whether there are other servers or devices that carry the domain name of an organization. Let's take a look at how DNS enumeration works. Imagine we are trying to find out all the publicly available servers Google has on the internet. Using the host utility in Linux and specifying a hostname, host www.google.com, we can see the IP address 172.217.6.196 has been resolved successfully. This means there's a device with a host name of www.google.com active. Furthermore, if we attempt to resolve the host name, gmail.google.com, another IP address is presented but when we attempt to resolve mx.google.com, no IP address is given. This is an indication that there isn't an active device with the mx.google.com host name: DNS zone transfer DNS zone transfer allows the copying of the master file from a DNS server to another DNS server. There are times when administrators do not configure the security settings on their DNS server properly, which allows an attacker to retrieve the master file containing a list of the names and addresses of a corporate network. Microsoft RPC Endpoint Mapper Not too long ago, CVE-2015-2370 was recorded on the CVE database. This vulnerability took advantage of the authentication implementation of the Remote Procedure Call (RPC) protocol in various versions of the Microsoft Windows platform, both desktop and server operating systems. A successful exploit would allow an attacker to gain local privileges on a vulnerable system. SMTP SMTP is used in mail servers, as with the POP and the Internet Message Access Protocol (IMAP). SMTP is used for sending mail, while POP and IMAP are used to retrieve mail from an email server. SMTP supports various commands, such as EXPN and VRFY. The EXPN command can be used to verify whether a particular mailbox exists on a local system, while the VRFY command can be used to validate a username on a mail server. An attacker can establish a connection between the attacker's machine and the mail server on port 25. Once a successful connection has been established, the server will send a banner back to the attacker's machine displaying the server name and the status of the port (open). Once this occurs, the attacker can then use the VRFY command followed by a user name to check for a valid user on the mail system using the VRFY bob syntax. SYN flooding One of the protocols that exist at the Transport Layer is TCP. TCP is used to establish a connection-oriented session between two devices that want to communication or exchange data. Let's recall how TCP works. There are two devices that want to exchange some messages, Bob and Alice. Bob sends a TCP Synchronization (SYN) packet to Alice, and Alice responds to Bob with a TCP Synchronization/Acknowledgment (SYN/ACK) packet. Finally, Bob replies with a TCP Acknowledgement (ACK) packet. The following diagram shows the TCP 3-Way Handshake mechanism: For every TCP SYN packet received on a device, a TCP ACK packet must be sent back in response. One type of attack that takes advantage of this design flaw in TCP is known as a SYN Flood attack. In a SYN Flood attack, the attacker sends a continuous stream of TCP SYN packets to a target system. This would cause the target machine to process each individual packet and response accordingly; eventually, with the high influx of TCP SYN packets, the target system will become too overwhelmed and stop responding to any requests: TCP reassembly and sequencing During a TCP transmission of datagrams between two devices, each packet is tagged with a sequence number by the sender. This sequence number is used to reassemble the packets back into data. During the transmission of packets, each packet may take a different path to the destination. This may cause the packets to be received in an out-of-order fashion, or in the order they were sent over the wire by the sender. An attacker can attempt to guess the sequencing numbers of packets and inject malicious packets into the network destined for the target. When the target receives the packets, the receiver would assume they came from the real sender as they would contain the appropriate sequence numbers and a spoofed IP address. Summary In this article, we have explored the different types of vulnerabilities that exist at the Application and Transport Layer of the TCP/IP protocol suite. To understand other networking concepts like network architecture, security, network monitoring, and troubleshooting; and ace the CompTIA certification exam, check out our book CompTIA Network+ Certification Guide AWS announces more flexibility its Certification Exams, drops its exam prerequisites Top 10 IT certifications for cloud and networking professionals in 2018 What matters on an engineering resume? Hacker Rank report says skills, not certifications
Read more
  • 0
  • 0
  • 15937

article-image-python-web-development-django-vs-flask-2018
Aaron Lazar
28 May 2018
7 min read
Save for later

Python web development: Django vs Flask in 2018

Aaron Lazar
28 May 2018
7 min read
A colleague of mine, wrote an article over two years ago, talking about the two top Python web frameworks, Django and Flask. It’s 2018 now, and a lot has changed in the IT world. There have been a couple of frameworks that emerged or gained popularity in the last 3 years, like Bottle or CherryPy, for example. However, Django and Flask have managed to stand their ground and have continued to remain as the top 2 Python frameworks. Moreover, there have been some major breakthroughs in web application architecture like the rise of Microservices, that has in turn pushed the growth of newer architectures like Serverless and Cloud-Native. I thought it would be a good idea to present a more modern comparison of these two frameworks, to help you take an informed decision on which one you should be choosing for your application development. So before we dive into ripping these frameworks apart, let’s briefly go over a few factors we’ll be considering while evaluating them. Here’s what I got in mind, in no particular order: Ease of use Popularity Community support Job market Performance Modern Architecture support Ease of use This is something l like to cover first, because I know it’s really important for developers who are just starting out, to assess the learning curve before they attempt to scale it. When I’m talking about ease of use, I’m talking about how easy it is to get started with using the tool in your day to day projects. Flask, like it’s webpage, is a very simple tool to learn, simply because it’s built to be simple. Moreover, the framework is un-opinionated, meaning that it will allow you to implement things the way you choose to, without throwing a fuss. This is really important when you’re starting out. You don’t want to run into too much issues that will break your confidence as a developer. On the other hand, Django is a great framework to learn too. While several Python developers will disagree with me, I would say Django is a pretty complex framework, especially for a newbie. Now this is not all that bad, right. I mean, especially when you’re building a large project, you want to be the one holding the reins. If you’re starting out with some basic projects then, it may be wise not to choose Django. The way I see it, learning Flask first will allow you to learn Django much faster. Popularity Both frameworks are quite popular, with Django starring at 34k on Github, and Flask having a slight edge at 36k. If you take a look at the Google trends, both tools follow a pretty similar trend, with Django’s volume much higher, owing to its longer existence. Source: SEM Rush As mentioned before, Flask is more popular among beginners and those who want to build basic websites easily. On the other hand, Django is more popular among the professionals who have years of experience building robust websites. Community Support and Documentation In terms of community support, we’re looking at how involved the community is, in developing the tool and providing support to those who need it. This is quite important for someone who’s starting out with a tool, or for that matter, when there’s a new version releasing and you need to keep yourself up to date.. Django features 170k tags on Stackoverflow, which is over 7 times that of Flask, which stands at 21k. Although Django is a clear winner in terms of numbers, both mailing lists are quite active and you can receive all the help you need, quite easily. When it comes to documentation, Django has some solid documentation that can help you get up and running in no time. On the other hand, Flask has good documentation too, but you usually have to do some digging to find what you’re looking for. Job Scenes Jobs are really important especially if you’re looking for a corporate one It’s quite natural that the organization that’s hiring you will already be working with a particular stack and they will expect you to have those skills before you step in. Django records around 2k jobs on Indeed in the USA, while Flask records exactly half that amount. A couple of years ago, the situation was pretty much the same; Django was a prime requirement, while Flask had just started gaining popularity. You’ll find a comment stating that “Picking up Flask might be a tad easier then Django, but for Django you will have more job openings” Itjobswatch.uk lists Django as the 2nd most needed Skill for a Python Developer, whereas Flask is way down at 20. Source: itjobswatch.uk Clearly Django is in more demand that Flask. However, if you are an independent developer, you’re still free to choose the framework you wish to work with. Performance Honestly speaking, Flask is a microframework, which means it delivers a much better performance in terms of speed. This is also because in Flask, you could write 10k lines of code, for something that would take 24k lines in Django. Response time comparison for data from remote server: Django vs Flask In the above image we see how both tools perform in terms of loading a response from the server and then returning it. Both tools are pretty much the same, but Flask has a slight edge over Django. Load time comparison from database with ORM: Django vs Flask In this image, we see how the gap between the tools is quite large, with Flask being much more efficient in loading data from the database. When we talk about performance, we also need to consider the power each framework provides you when you want to build large apps. Django is a clear winner here, as it allows you to build massive, enterprise grade applications. Django serves as a full-stack framework, which can easily be integrated with JavaScript to build great applications. On the other hand, Flask is not suitable for large applications. The JetBrains Python Developer Survey revealed that Django was a more preferred option among the respondents. Jetbrains Python Developer Survey 2017 Modern Architecture Support The monolith has been broken and microservices have risen. What’s interesting is that although applications are huge, they’re now composed of smaller services working together to make up the actual application. While you would think Django would be a great framework to build microservices, it turns out that Flask serves much better, thanks to its lightweight architecture and simplicity. While you work on a huge enterprise application, you might find Flask being interwoven wherever a light framework works best. Here’s the story of one company that ditched Django for microservices. I’m not going to score these tools because they’re both awesome in their own right. The difference arises when you need to choose one for your projects and it’s quite evident that Flask should be your choice when you’re working on a small project or maybe a smaller application built into a larger one, maybe a blog or a small website or a web service. Although, if you’re on the A team, making a super awesome website for maybe, Facebook or a billion dollar enterprise, instead of going the Django unchained route, choose Django with a hint of Flask added in, for all the right reasons. :) Django recently hit version 2.0 last year, while Flask hit version 1.0 last month. Here’s some great resources to get you up and running with Django and Flask. So what are you waiting for? Go build that website! Why functional programming in Python matters Should you move to Python 3.7 Why is Python so good for AI and Machine Learning?
Read more
  • 0
  • 0
  • 15859
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-best-machine-learning-datasets-for-beginners
Natasha Mathur
19 Sep 2018
13 min read
Save for later

Best Machine Learning Datasets for beginners

Natasha Mathur
19 Sep 2018
13 min read
“It’s not who has the best algorithm that wins. It’s who has the most data” ~ Andrew Ng If you would look at the way algorithms were trained in Machine Learning, five or ten years ago, you would notice one huge difference. Training algorithms in Machine Learning are much better and efficient today than it used to be a few years ago. All credit goes to the hefty amount of data that is available to us today. But, how does Machine Learning make use of this data? Let’s have a look at the definition of Machine Learning. “Machine Learning provides computers or machines the ability to automatically learn from experience without being explicitly programmed”. Machines “learn from experience” when they’re trained, this is where data comes into the picture. How’re they trained? Datasets!   This is why it is so crucial that you feed these machines with the right data for whatever problem it is that you want these machines to solve. Why datasets matter in Machine Learning? The simple answer is because Machines too like humans are capable of learning once they see relevant data. But where they vary from humans is the amount of data they need to learn from. You need to feed your machines with enough data in order for them to do anything useful for you. This why Machines are trained using massive datasets. We can think of machine learning data like a survey data, meaning the larger and more complete your sample data size is, the more reliable your conclusions will be. If the data sample isn’t large enough then it won’t be able to capture all the variations making your machine reach inaccurate conclusions, learn patterns that don’t really exist, or not recognize patterns that do. Datasets help bring the data to you. Datasets train the model for performing various actions. They model the algorithms to uncover relationships, detect patterns, understand complex problems as well as make decisions. Apart from using datasets, it is equally important to make sure that you are using the right dataset, which is in a useful format and comprises all the meaningful features, and variations. After all, the system will ultimately do what it learns from the data. Feeding right data into your machines also assures that the machine will work effectively and produce accurate results without any human interference required. For instance, training a speech recognition system with a textbook English dataset will result in your machine struggling to understand anything but textbook English. So, any loose grammar, foreign accents, or speech disorders would get missed out. For such a system, using a dataset comprising all the infinite variations in a spoken language among speakers of different genders, ages, and dialects would be a right option. So keep in mind that it is important that the quality, variety, and quantity of your training data is not compromised as all these factors help determine the success of your machine learning models. Top Machine Learning Datasets for Beginners Now, there are a lot of datasets available today for use in your ML applications. It can be confusing, especially for a beginner to determine which dataset is the right one for your project. It is better to use a dataset which can be downloaded quickly and doesn’t take much to adapt to the models. Further, always use standard datasets that are well understood and widely used. This lets you compare your results with others who have used the same dataset to see if you are making progress. You can pick the dataset you want to use depending on the type of your Machine Learning application. Here’s a rundown of easy and the most commonly used datasets available for training Machine Learning applications across popular problem areas from image processing to video analysis to text recognition to autonomous systems. Image Processing There are many image datasets to choose from depending on what it is that you want your application to do. Image processing in Machine Learning is used to train the Machine to process the images to extract useful information from it. For instance, if you’re working on a basic facial recognition application then you can train it using a dataset that has thousands of images of human faces. This is how Facebook knows people in group pictures. This is also how image search works in Google and in other visual search based product sites. Dataset Name Brief Description 10k US Adult Faces Database This database consists of 10,168 natural face photographs and several measures for 2,222 of the faces, including memorability scores, computer vision, and psychological attributes. The face images are JPEGs with 72 pixels/in resolution and 256-pixel height. Google's Open Images Open Images is a dataset of 9 million URLs to images which have been annotated with labels spanning over 6000 categories. These labels cover more real-life entities and the images are listed as having a Creative Commons Attribution license. Visual Genome This is a dataset of over 100k images densely annotated with numerous region descriptions ( girl feeding elephant), objects (elephants), attributes(large), and relationships (feeding). Labeled Faces in the Wild This database comprises more than 13,000 images of faces collected from the web. Each face is labeled with the name of the person pictured.   Fun and easy ML application ideas for beginners using image datasets: Cat vs Dogs: Using Cat and Stanford Dogs dataset to classify whether an image contains a dog or a cat. Iris Flower classification: You can build an ML project using Iris flower dataset where you classify the flowers in any of the three species. What you learn from this toy project will help you learn to classify physical attributes based content to build some fun real-world projects like fraud detection, criminal identification, pain management ( eg; ePAT which detects facial hints of pain using facial recognition technology), and so on. Hot dog - Not hot dog: Use the Food 101 dataset, to distinguish different food types as a hot dog or not. Who knows, you could end up becoming the next Emmy award nominee! Sentiment Analysis As a beginner, you can create some really fun applications using Sentiment Analysis dataset. Sentiment Analysis in Machine Learning applications is used to train machines to analyze and predict the emotion or sentiment associated with a sentence, word, or a piece of text. This is used in movie or product reviews often. If you are creative enough, you could even identify topics that will generate the most discussions using sentiment analysis as a key tool. Dataset Name Brief Description Sentiment140 A popular dataset, which uses 160,000 tweets with emoticons pre-removed Yelp Reviews An open dataset released by Yelp, contains more than 5 million reviews on Restaurants, Shopping, Nightlife, Food, Entertainment, etc. Twitter US Airline Sentiment Twitter data on US airlines starting from February 2015, labeled as positive, negative, and neutral tweets. Amazon reviews This dataset contains over 35 million reviews from Amazon spanning 18 years. Data include information on products, user ratings, and the plaintext review.   Easy and Fun Application ideas using Sentiment Analysis Dataset: Positive or Negative: Using Sentiment140 dataset in a model to classify whether given tweets are negative or positive. Happy or unhappy: Using Yelp Reviews dataset in your project to help machine figure out whether the person posting the review is happy or unhappy.   Good or Bad: Using Amazon Reviews dataset, you can train a machine to figure out whether a given review is good or bad. Natural Language Processing Natural language processing deals with training machines to process and analyze large amounts of natural language data. This is how search engines like Google know what you are looking for when you type in your search query. Use these datasets to make a basic and fun NLP application in Machine Learning: Dataset Name Brief Description Speech Accent Archive This dataset comprises 2140 speech samples from different talkers reading the same reading passage. These Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English. Wikipedia Links data This dataset consists of almost 1.9 billion words from more than 4 million articles. Search is possible by word, phrase or part of a paragraph itself. Blogger Corpus A dataset comprising 681,288 blog posts gathered from blogger.com. Each blog consists of minimum 200 occurrences of commonly used English words.   Fun Application ideas using NLP datasets: Spam or not: Using Spambase dataset, you can enable your application to figure out whether a given email is spam or not. Video Processing Video Processing datasets are used to teach machines to analyze and detect different settings, objects, emotions, or actions and interactions in videos. You’ll have to feed your machine with a lot of data on different actions, objects, and activities. Dataset Name Brief Description UCF101 - Action Recognition Data Set This dataset comes with 13,320 videos from 101 action categories. Youtube 8M YouTube-8M is a large-scale labeled video dataset. It contains millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities.   Fun Application ideas using video processing dataset: Action detection: Using UCF101 - Action Recognition DataSet, or Youtube 8M, you can train your application to detect the actions such as walking, running etc, in a video. Speech Recognition Speech recognition is the ability of a machine to analyze or identify words and phrases in a spoken language. Feed your machine with the right and good amount of data, and it will help it in the process of recognizing speech. Combine speech recognition with natural language processing, and get Alexa who knows what you need. Dataset Name Brief Description Gender Recognition by Voice and speech analysis This database identifies a voice as male or female, depending on the acoustic properties of voice and speech. The dataset contains 3,168 recorded voice samples, collected from male and female speakers. Human Activity Recognition w/Smartphone Human Activity Recognition database consists of recordings of 30 subjects performing activities of daily living (ADL) while carrying a smartphone ( Samsung Galaxy S2 ) on the waist. TIMIT TIMIT provides speech data for acoustic-phonetic studies and for the development of automatic speech recognition systems. It comprises broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences, phonetic and word transcriptions. Speech Accent Archive This dataset contains 2140 speech samples, each from a different talker reading the same reading passage. Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English.   Fun Application ideas using Speech Recognition dataset: Accent detection: Use Speech Accent Archive dataset, to make your application identify different accents from a given sample of accents. Identify the activity: Use Human Activity Recognition w/Smartphone dataset to help your application detect the human activity. Natural Language Generation Natural Language generation refers to the ability of machines to simulate the human speech. It can be used to translate written information into aural information or assist the vision-impaired by reading out aloud the contents of a display screen. This is how Alexa or Siri respond to you. Dataset Name Brief Description Common Voice by Mozilla Common Voice dataset contains speech data read by users on the Common Voice website from a number of public sources like user-submitted blog posts, old books, movies, etc. LibriSpeech This dataset consists of nearly 500 hours of clean speech of various audiobooks read by multiple speakers, organized by chapters of the book with both the text and the speech.   Fun Application ideas using Natural Language Generation dataset: Converting text into Audio: Using Blogger Corpus dataset, you can train your application to read out loud the posts on blogger. Autonomous Driving Build some basic self-driving Machine Learning Applications. These Self-driving datasets will help you train your machine to sense its environment and navigate accordingly without any human interference. Autonomous cars, drones, warehouse robots, and others use these algorithms to navigate correctly and safely in the real world. Datasets are even more important here as the stakes are higher and the cost of a mistake could be a human life. Dataset Name Brief Description Berkeley DeepDrive BDD100k This is one of the largest datasets for self-driving AI currently. It comprises over 100,000 videos of over 1,100-hour driving experiences across different times of the day and weather conditions. Baidu Apolloscapes Large dataset consisting of 26 different semantic items such as cars, bicycles, pedestrians, buildings, street lights, etc. Comma.ai This dataset consists of more than 7 hours of highway driving. It includes details on car’s speed, acceleration, steering angle, and GPS coordinates. Cityscape Dataset This is a large dataset that contains recordings of urban street scenes in 50 different cities. nuScenes This dataset consists of more than 1000 scenes with around 1.4 million image, 400,000 sweeps of lidars (laser-based systems that detect the distance between objects), and 1.1 million 3D bounding boxes ( detects objects with a combination of RGB cameras, radar, and lidar).   Fun Application ideas using Autonomous Driving dataset: A basic self-driving application: Use any of the self-driving datasets mentioned above to train your application with different driving experiences for different times and weather conditions.   IoT Machine Learning in building IoT applications is on the rise these days. Now, as a beginner in Machine Learning, you may not have advanced knowledge on how to build these high-performance IoT applications using Machine Learning, but you certainly can start off with some basic datasets to explore this exciting space. Dataset Name Brief Description Wayfinding, Path Planning, and Navigation Dataset This dataset consists of samples of trajectories in an indoor building (Waldo Library at Western Michigan University) for navigation and wayfinding applications. ARAS Human Activity Dataset This dataset is a Human activity recognition Dataset collected from two real houses. It involves over 26 millions of sensor readings and over 3000 activity occurrences.   Fun Application ideas using IoT dataset: Wearable device to track human activity: Use the ARAS Human Activity Dataset to train a wearable device to identify human activity. Read Also: 25 Datasets for Deep Learning in IoT Once you’re done going through this list, it’s important to not feel restricted. These are not the only datasets which you can use in your Machine Learning Applications. You can find a lot many online which might work best for the type of Machine Learning Project that you’re working on. Some popular sources of a wide range of datasets are Kaggle,  UCI Machine Learning Repository, KDnuggets, Awesome Public Datasets, and Reddit Datasets Subreddit. With all this information, it is now time to use these datasets in your project. In case you’re completely new to Machine Learning, you will find reading, ‘A nonprogrammer’s guide to learning Machine learning’quite helpful. Regardless of whether you’re a beginner or not, always remember to pick a dataset which is widely used, and can be downloaded quickly from a reliable source. How to create and prepare your first dataset in Salesforce Einstein Google launches a Dataset Search Engine for finding Datasets on the Internet Why learn machine learning as a non-techie?
Read more
  • 0
  • 0
  • 15579

article-image-dl-frameworks-tensorflow-vs-cntk
Aaron Lazar
30 Oct 2017
6 min read
Save for later

The Deep Learning Framework Showdown: TensorFlow vs CNTK

Aaron Lazar
30 Oct 2017
6 min read
The question several Deep Learning engineers may ask themselves is: Which is better, TensorFlow or CNTK? Well, we're going to answer that question for you, taking you through a closely fought match between the two most exciting frameworks. So, here we are, ladies and gentlemen, it's fight night and it's a full house. In the Red corner, weighing in at two hundred and seventy pounds of Python and topping out at over ten thousand frames per second; managed by the American tech giant, Google; we have the mighty, the beefy, TensorFlow! In the Blue corner, weighing in at two hundred and thirty pounds of C++ muscle, we have, one of the top toolkits that can comfortably scale beyond a single machine. Managed by none other than Microsoft, it's fast, it's furious, it's CNTK aka the Microsoft Cognitive Toolkit! And we're into Round One… TensorFlow and CNTK are looking quite menacingly at each other and are raging to take down their opponents. TensorFlow seems pleased that its compile times are considerably faster than its successor, Theano. Although, it looks like happiness came a tad bit soon. CNTK, light and bouncy on it's feet, comes straight out of nowhere with a whopping seventy thousand frames/second upper cut, knocking TensorFlow to the floor. TensorFlow looks like it's in no mood to give up anytime soon. It makes itself so simple to use and understand that even students can pick it up and start training their own models. This isn't the case with CNTK, as it begs to shed its complexity. On the other hand, CNTK seems to be thrashing TensorFlow in terms of 3D convolution, where CNTK can clearly recognize images from streaming content. TensorFlow also tries its best to run LSTM RNNs, but in vain. The crowd keeps cheering on… Wait a minute...are they calling out for TensorFlow? Yes they are! There's hardly any cheering for CNTK. This is embarrassing! Looks like its community support can't match up to TensorFlow's. And ladies and gentlemen, that does make a difference - we can see TensorFlow improving on several fronts and gradually getting back in the game! TensorFlow huffs and puffs as it tries to prove that it's not just about deep learning and that it has tools in the pocket that can support other algorithms such as reinforcement learning. It conveniently whips out the TensorBoard, and drops CNTK to the floor with its beautiful visualizations. TensorFlow now has the upper hand and is trying hard to pin CNTK to the floor and tries to use its R support to finish it off. But CNTK tactfully breaks loose and leaves TensorFlow on the floor - still not ready to be used in production. And there goes the bell for Round One! Both fighters look exhausted but you can see a faint twinkle in TensorFlow's eye, primarily because it survived Round One. Google seems to be working hard to prep it for Round Two and is making several improvements in terms of speed, flexibility and majorly making it ready for production. Meanwhile, Microsoft boosts CNTK's spirits with a shot of Python APIs in its blood. As it moves towards reaching version 2.0, there are a lot of improvements to CNTK, wherein, Microsoft has ensured that it's not left behind, like having a backend for Keras, which puts it on par with TensorFlow. Moreover, there seem to be quite a few experimental features that it looks ready to enter the ring with, like the Java API for example. It's the final round and boy, are these two into a serious stare-down! The referee waves them in and off they are. CNTK needs to get back at TensorFlow. Comfortably supporting multiple GPUs and CPUs out of the box, across both the Microsoft and Linux platforms, it has an advantage over TensorFlow. Is it going to use that trump card? Yes it is! A thousand GPUs and a hundred machines in, and CNTK is raining blows on TensorFlow. TensorFlow clearly drops the ball when it comes to multiple machines, and it rather complicates things. It's high time that TensorFlow turned the tables. Lo and behold! It shows off its mobile deep learning capabilities with TensorFlow Lite, clearly flipping CNTK flat on its back. This is revolutionary and a tremendous breakthrough for TensorFlow! CNTK, however, is clearly the people's choice when it comes to language compatibility. With support for C++, Python, C#/.NET and now Java, it's clearly winning in this area. Round Two is coming to an end, ladies and gentlemen and it's a neck to neck battle out there. We're not sure the judges are going to be able to choose a clear winner, from the looks of it. And…. there goes the bell! While the scores are being tallied, we go over to the teams and some spectators for some gossip on the what's what of deep learning. Did you know having multiple machine support is a huge advantage? It increases speed and efficiency by almost 10 times! That's something! We also got to know that TensorFlow is training hard and is picking up positives from its rival, CNTK. There are also rumors about a new kid called MXNet (read about it here), that has APIs in R, Python and even in Julia! This makes it one helluva framework in terms of flexibility and speed. In fact, AWS is already implementing it while Apple also is rumored to be using it. Clearly, something to watch out for. And finally, the judges have made their decision. Ladies and gentlemen, after two rounds of sheer entertainment, we have the results... TensorFlow CNTK Processing speed 0 1 Learning curve 1 0 Production readiness 0 1 Community support 1 0 CPU, GPU computation support 0 1 Mobile deep learning 1 0 Multiple language compatibility 0 1 It's a unanimous decision and just as we thought, CNTK is the heavyweight champion! CNTK clearly beat TensorFlow in terms of performance, because of its flexibility, speed and ability to use in production! As a Deep Learning engineer, should you be wanting to use one of these frameworks in your tasks, you should check out their features thoroughly, test them out with a test dataset and then implement them to your actual data. After all, it's the choices we make that define a win or a loss - simplicity over resource utilisation, or speed over platform, we must choose our tools wisely. For more information on the kind of tests that both the tools have been put through, read the Research Paper presented by Shaohuai Shi, Qiang Wang, Pengfei Xu and Xiaowen Chu from the Department of Computer Science, Hong Kong Baptist University and these benchmarks.
Read more
  • 0
  • 1
  • 15573

article-image-iterative-machine-learning-step-towards-model-accuracy
Amarabha Banerjee
01 Dec 2017
10 min read
Save for later

Iterative Machine Learning: A step towards Model Accuracy

Amarabha Banerjee
01 Dec 2017
10 min read
Learning something by rote i.e., repeating it many times, perfecting a skill by practising it over and over again or building something by making minor adjustments progressively to a prototype are things that comes to us naturally as human beings. Machines can also learn this way and this is called ‘Iterative machine learning’. In most cases, iteration is an efficient learning approach that helps reach the desired end results faster and accurately without becoming a resource crunch nightmare. Now, you might wonder, isn’t iteration inherently part of any kind of machine learning? In other words, modern day machine learning techniques across the spectrum from basic regression analysis, decision trees, Bayesian networks, to advanced neural nets and deep learning algorithms have some inherent iterative component built into them. What is the need, then, for discussing iterative learning as a standalone topic? This is simply because introducing iteration externally to an algorithm can minimize the error margin and therefore help in accurate modelling.  How Iterative Learning works Let’s understand how iteration works by looking closely at what happens during a single iteration flow within a machine learning algorithm. A pre-processed training dataset is first introduced into the model. After processing and model building with the given data, the model is tested, and then the results are matched with the desired result/expected output. The feedback is then returned back to the system for the algorithm to further learn and fine tune its results. This clearly shows that two iteration processes take place here: Data Iteration - Inherent to the algorithm Model Training Iteration - Introduced externally Now, what if we did not feedback the results into the system i.e. did not allow the algorithm to learn iteratively but instead adopted a sequential approach? Would the algorithm work and would it provide the right results? Yes, the algorithm would definitely work. However, the quality of the results it produces is going to vary vastly based on a number of factors. The quality and quantity of the training dataset, the feature definition and extraction techniques employed, the robustness of the algorithm itself are among many other factors. Even if all of the above were done perfectly, there is still no guarantee that the results produced by a sequential approach will be highly accurate. In short, the results will neither be accurate nor reproducible. Iterative learning thus allows algorithms to improve model accuracy. Certain algorithms have iteration central to their design and can be scaled as per the data size. These algorithms are at the forefront of machine learning implementations because of their ability to perform faster and better. In the following sections we will discuss iteration in different sets of algorithms each from the three main machine learning approaches - supervised ML, unsupervised ML and reinforcement learning. The Boosting algorithms: Iteration in supervised ML The boosting algorithms, inherently iterative in nature, are a brilliant way to improve results by minimizing errors. They are primarily designed to reduce bias in results and transform a particular set of weak learning classifier algorithms to strong learners and to enable them to reduce errors. Some examples are: AdaBoost (Adaptive Boosting) Gradient Tree Boosting XGBoost How they work All boosting algorithms have a common classifiers which are iteratively modified to reach the desired result. Let’s take the example of finding cases of plagiarism in a certain article. The first classifier here would be to find a group of words that appear somewhere else or in another article which would result in a red flag. If we create 10 separate group of words and term them as classifiers 1 to 10, then our article will be checked on the basis of this classifier and any possible matches will be red flagged. But no red flags with these 10 classifiers would not mean a definite 100% original article. Thus, we would need to update the classifiers, create shorter groups perhaps based on the first pass and improve the accuracy with which the classifiers can find similarity with other articles. This iteration process in Boosting algorithms eventually leads us to a fairly high rate of accuracy. The reason being after each iteration, the classifiers are updated based on their performance. The ones which have close similarity with other content are updated and tweaked so that we can get a better match. This process of improving the algorithm inherently, is termed as boosting and is currently one of the most popular methods in Supervised Machine Learning. Strengths & weaknesses The obvious advantage of this approach is that it allows minimal errors in the final model as the iteration enables the model to correct itself every time there is an error. The downside is the higher processing time and the overall memory requirement for a large number of iterations. Another important aspect is that the error fed back to train the model is done externally, which means the supervisor has control over the model and how it modifies. This in turn has a downside that the model doesn’t learn to eliminate error on its own. Hence, the model is not reusable with another set of data. In other words, the model does not learn how to become error-free by itself and hence cannot be ported to another dataset as it would need to start the learning process from scratch. Artificial Neural Networks: Iteration in unsupervised ML Neural Networks have become the poster child for unsupervised machine learning because of their accuracy in predicting data models. Some well known neural networks are: Convolutional Neural Networks   Boltzmann Machines Recurrent Neural Networks Deep Neural Networks Memory Networks How they work Artificial neural networks are highly accurate in simulating data models mainly because of their iterative process of learning. But this process is different from the one we explored earlier for Boosting algorithms. Here the process is seamless and natural and in a way it paves the way for reinforcement learning in AI systems. Neural Networks consist of electronic networks simulating the way the human brain is works. Every network has an input and output node and in-between hidden layers that consist of algorithms. The input node is given the initial data set to perform a set of actions and each iteration creates a result that is output as a string of data. This output is then matched with the actual result dataset and the error is then fed back to the input node. This error then enables the algorithms to correct themselves and reach closer and closer to the actual dataset. This process is called training the Neural Networks and each iteration improve the accuracy. The key difference between the iteration performed here as compared to how it is performed by Boosting algorithms is that here we don’t have to update the classifiers manually, the algorithms change themselves based on the error feedback. Strengths & weaknesses The main advantage of this process is obviously the level of accuracy that it can achieve on its own. The model is also reusable because it learns the means to achieve accuracy and not just gives you a direct result. The flip side of this approach is that the models can go wrong heavily and deviate completely in a different direction. This is because the induced iteration takes its own course and doesn’t need human supervision. The facebook chat-bots deviating from their original goal and communicating within themselves in a language of their own is a case in point. But as is the saying, smart things come with their own baggage. It’s a risk we would have to be ready to tackle if we want to create more accurate models and smarter systems.    Reinforcement Learning Reinforcement learning is a interesting case of machine learning where the simple neural networks are connected and together they interact with the environment to learn from their mistakes and rewards. The iteration introduced here happens in a complex way. The iteration happens in the form of reward or punishment for arriving at the correct or wrong results respectively. After each interaction of this kind, the multilayered neural networks incorporate the feedback, and then recreate the models for better accuracy. The typical type of reward and punishment method somewhat puts it in a space where it is neither supervised nor unsupervised, but exhibits traits of both and also has the added advantage of producing more accurate results. The con here is that the models are complex by design. Multilayered neural networks are difficult to handle in case of multiple iterations because each layer might respond differently to a certain reward or punishment. As such it may create inner conflict that might lead to a stalled system - one that can’t decide which direction to move next. Some Practical Implementations of Iteration Many modern day machine learning platforms and frameworks have implemented the iteration process on their own to create better data models, Apache Spark and MapR are two such examples. The way the two implement iteration is technically different and they have their merits and limitations. Let’s look at MapReduce. It reads and writes data directly onto HDFS filesystem present on the disk. Note that for every iteration to be read and written from the disk needs significant time. This in a way creates a more robust and fault tolerant system but compromises on the speed. On the other hand, Apache Spark stores the data in memory (Resilient Distributed DataSet) i.e. in the RAM. As a result, each iteration takes much less time which enables Spark to perform lightning fast data processing. But the primary problem with the Spark way of doing iteration is that dynamic memory or RAM is much less reliable than disk memory to store iteration data and perform complex operations. Hence it’s much less fault tolerant that MapR.   Bringing it together To sum up the discussion, we can look at the process of iteration and its stages in implementing machine learning models roughly as follows: Parameter Iteration: This is the first and inherent stage of iteration for any algorithm. The parameters involved in a certain algorithm are run multiple times and the best fitting parameters for the model are finalized in this process. Data Iteration: Once the model parameters are finalized, the data is put into the system and the model is simulated. Multiple sets of data are put into the system to check the parameters’ effectiveness in bringing out the desired result. Hence, if data iteration stage suggests that some of the parameters are not well suited for the model, then they are taken back to the parameter iteration stage and parameters are added or modified. Model Iteration: After the initial parameters and data sets are finalized, the model testing/ training happens. The iteration in model testing phase is all about running the same model simulation multiple times with the same parameters and data set, and then checking the amount of error, if the error varies significantly in every iteration, then there is something wrong with either the data or the parameter or both. Iterations are done to data and parameters until the model achieves accuracy.   Human Iteration: This step involves the human induced iteration where different models are put together to create a fully functional smart system. Here, multiple levels of fitting and refitting happens to achieve a coherent overall goal such as creating a driverless car system or a fully functional AI. Iteration is pivotal to creating smarter AI systems in the near future. The enormous memory requirements for performing multiple iterations on complex data sets continue to pose major challenges. But with increasingly better AI chips, storage options and data transfer techniques, these challenges are getting easier to handle. We believe iterative machine learning techniques will continue to lead the transformation of the AI landscape in the near future.  
Read more
  • 0
  • 0
  • 15573

article-image-five-biggest-challenges-information-security-2017
Charanjit Singh
08 Nov 2016
5 min read
Save for later

Five Biggest Challenges in Information Security in 2017

Charanjit Singh
08 Nov 2016
5 min read
Living in the digital age brings its own challenges. News of security breaches in well-known companies is becoming a normal thing. In the battle between those who want to secure the Internet and those who want to exploit its security vulnerabilities, here's a list of five significant security challenges that I think information security is/will be facing in 2017. Army of young developers Everyone's beloved celebrity is encouraging the population to learn how to code, and it's working. Learning to code is becoming easier every day. There are loads of apps and programs to help people learn to code. But not many of them care to teach how to write secure code. Security is usually left as an afterthought, an "advanced" topic to learn sometime in future. Even without the recent fame, software development is a lucrative career. It has attracted a lot of 9-to-5ers who just care about getting through the day and collecting their paycheck. This army of young developers who care little about the craft is most to blame when it comes to vulnerabilities in applications. It would astonish you to learn how many people simply don't care about the security of their applications. The pressure to ship and ever-slipping deadlines don't make it any better. Rise of the robots I mean IoT devices. Sorry, I couldn't resist the temptation. IoT devices are everywhere. "Internet of Things" they call it. As if Internet wasn't insecure enough already, it's on "things" now. Most of these things rarely have any concept of security. Your refrigerator can read your tweets, and so can your 13-year-old neighbor. We've already seen a lot of famous disclosures of cars getting hacked. It's one of the examples of how dangerous it can get. Routers and other such infrastructure devices are becoming smarter and smarter. The more power they get, the more lucrative they become for a hacker to attack them. Your computer may have a firewall and anti-virus and other fancy security software, but your router might not. Most people don't even change the default password for such devices. It's much easier for an attacker to simply control your means of connecting to the Internet than connecting to your device directly. On the other front, these devices can be (and have been) used as bots to launch attacks (like DDoS) elsewhere. Internet advertisements as malware The Internet economy is hugely dependent on advertisements. Advertisements is a big big business, but it is becoming uglier and uglier every day. As if tracking users all over the webs and breaching their privacy was not enough, advertisements are now used for spreading malware. Ads are very attractive to attackers as they can be used to distribute content on fully legitimate sites without actually compromising them. They've already been in the news for this very reason lately. So the Internet can potentially be used to do great damage. Mobile devices Mobile apps go everywhere you go. That cute little tap game you installed yesterday might result in the demise of your business. But that's just the tip of the iceberg. Android will hopefully add essential features to limit permissions granted to installed apps. New exploits are emerging everyday for vulnerabilities in mobile operating systems and even in the processor chips. Your company might have a secure network with every box checked, but what about the laptop and mobile device that Cindy brought in? Organizations need to be ever more careful about the electronic devices their employees bring into the premises, or use to connect to the company network. The house of security cards crumbles fast if attackers get access to the network through a legitimate medium. The weakest links If you follow the show Mr. Robot (you should, it's brilliant), you might remember a scene from the first Season when they plan to attack the "impenetrable" Steel Mountain. Quoting Elliot: Nothing is actually impenetrable. A place like this says it is, and it’s close, but people still built this place, and if you can hack the right person, all of a sudden you have a piece of powerful malware. People always make the best exploits. People are the weakest links in many technically secure setups. They're easiest to hack. Social engineering is the most common (and probably easiest) way to get access to an otherwise secure system. With the rise in advanced social engineering techniques, it is becoming crucial everyday to teach the employees how to detect and prevent such attacks. Even if your developers are writing secure code, it's doesn’t matter if the customer care representative just gives the password away or grants access to an attacker. Here's a video of how someone can break into your phone account with a simple call to your phone company. Once your phone account is gone, all your two-factor authentications (that depend on SMS-based OTPs) are worth nothing. About the author Charanjit Singh is a freelance JavaScript (React/Express) developer. Being an avid fan of functional programming, he’s on his way to take on Haskell/Purescript as his main professional languages.
Read more
  • 0
  • 0
  • 15551
article-image-difference-between-working-indie-and-aaa-game-development
Raka Mahesa
02 Oct 2017
5 min read
Save for later

The Difference Between Working in Indie and AAA Game Development

Raka Mahesa
02 Oct 2017
5 min read
Let's say we have two groups of video games. In the first group, we have games like The Witcher 3, Civilization VI, and Overwatch. And in the second group, we have games like Super Meat Boy, Braid, and Stardew Valley. Can you tell the difference between these two groups? Is one group of games better than the other? No, they are all good games that have achieved both critical and financial success. Are the games in the first group sequels, while games in the second group are new? No, Overwatch is a new, original IP. Are the games in the first group more expensive than the second group? Now we're getting closer. The truth is, the first group of games comes from searching Google for "popular AAA games," while the second group comes from searching for "popular indie games." In short, the games in the first group are AAA games, and in the second group are indie games. Indie vs. AAA game development Now that we've seen the difference between the two groups, why do people separate these games into two different groups? What makes these two groups of games different from each other? Some would say that they are priced differently, but there are actually AAA games with low pricing as well as indie games with expensive pricing. How about the scale of the games? Again, there are indie games with big, massive worlds, and there are also AAA games set in short, small worlds. From my perspective, the key difference between the two groups of games is the size of the company developing the games. Indie games are usually made by companies with less than 30 people, and some are even made by less than five people. On the other hand, AAA games are made by much bigger companies, usually with hundreds of employees. Game development teams: size matters Earlier, I mentioned that company size is the key difference between indie games and AAA games. So it's not surprising that it's also the main difference between indie and AAA game development. In fact, the difference in team or company size leads to every difference between the two game development processes. Let's start with something personal, your role or position in the development team. Big teams usually have every position they need already filled. If they need someone to work on the game engine, they already have a engine programmer there. If they need someone to design a level, they already have a level designer working on it. In a big team, your role is already determined from the start, and you will rarely work on any task outside of your job description. If AAA game development values specialists, then indie game development values generalists who can fill multiple roles. It's not weird at all in a small development team if a programmer is asked to deal with both networking as well as enemy AI. Small teams usually aren't able to individually cover all the needed positions, so they turn to people who are able to work on a variety of tasks. Funding across the games industry Let's move to another difference, this time from the funding aspect. A large team requires a large amount of funding, simply because it has more people that need to be paid. And, if you look at the bigger picture, it also means that video games made by a large team have a large development cost. The opposite rings true as well; indie game development has much smaller development costs because they have smaller teams. Because every project has a chance of failure, the large development cost of AAA games becomes a big problem. If you're only spending a little money, maybe you're fine with a small chance of failure, but if you're spending a large sum of money, you definitely want to reduce that risk as much as possible. This ends up with AAA game development being much more risk-averse; they're trying to avoid risk as much as possible. In AAA game development, when there's a decision that needs to be made, the team will try to make sure that they don't make the wrong choice. They will do extensive market research and they will see what is trending in the market. They'd want to grab as many audience members as possible, so if there's any design that will exclude a significant amount of customers, it will be cut out. On the other hand, indie game development doesn't spend that much money. With a smaller development cost, indie games don't need to have a massive amount of sale to recoup their costs. Because of that, they're willing to take risks with experimental and unorthodox design, giving the team the creative freedom without needing to do market research. That said, indie game development harbors a different kind of risk. Unlike their bigger counterpart, indie game developers tend to live from one game to the next. That is, they use the revenue from their current game to fund the development of their next game. So if any of their games don't perform well, they could immediately close down. And that's another difference between the two game development process, AAA game development tends to be more financially stable compared to indie development. There are more differences between indie and AAA game development, but the ones listed above are definitely some of the most prominent. All in all, one development process isn't better than the other, and it falls back on you to decide which one is better suited for you. Raka Mahesa is a game developer at Chocoarts, who is interested in digital technology in general. Outside of work hours, he likes to work on his own projects, with Corridoom VR being his latest released game. Raka also regularly tweets as @legacy99.
Read more
  • 0
  • 0
  • 15302

article-image-nvidia-leads-the-ai-hardware-race-but-which-of-its-gpus-should-you-use-for-deep-learning
Prasad Ramesh
29 Aug 2018
8 min read
Save for later

NVIDIA leads the AI hardware race. But which of its GPUs should you use for deep learning?

Prasad Ramesh
29 Aug 2018
8 min read
For readers who are new to deep learning and who might be wondering what a GPU is, let’s start there. To make it simple, consider deep learning as nothing more than a set of calculations - complex calculations, yes, but calculations nonetheless. To run these calculations, you need hardware. Ordinarily, you might just use a normal processor like the CPU inside your laptop. However, this isn’t powerful enough to process at the speed at which deep learning computations need to happen. GPUs, however, can. This is because while a conventional CPU has only a few complex cores, a GPU can have thousands of simple cores. With a GPU, training a deep learning data set can take just hours instead of days. However, although it’s clear that GPUs have significant advantages over CPUs, there is nevertheless a range of GPUs available, each having their own individual differences. Selecting one is ultimately a matter of knowing what your needs are. Let’s dig deeper and find out how to go about shopping for GPUs… What to look for before choosing a GPU? There are a few specifications to consider before picking a GPU. Memory bandwidth: This determines the capacity of a GPU to handle large amounts of data. It is the most important performance metric, as with faster memory bandwidth more data can be processed at higher speeds. Number of cores: This indicates how fast a GPU can process data. A large number of CUDA cores can handle large datasets well. CUDA cores are parallel processors similar to cores in a CPU but their number is in thousands and are not suited for complex calculations that a CPU core can perform. Memory size: For computer vision projects, it is crucial for memory size to be as much as you can afford. But with natural language processing, memory size does not play such an important role. Our pick of GPU devices to choose from The go to choice here is NVIDIA; they have standard libraries that make it simple to set things up. Other graphics cards are not very friendly in terms of the libraries supported for deep learning. NVIDIA CUDA Deep Neural Network library also has a good development community. “Is NVIDIA Unstoppable In AI?” -Forbes “Nvidia beats forecasts as sales of graphics chips for AI keep booming” -SiliconANGLE AMD GPUs are powerful too but lack library support to get things running smoothly. It would be really nice to see some AMD libraries being developed to break the monopoly and give more options to the consumers. NVIDIA RTX 2080 Ti: The RTX line of GPUs are to be released in September 2018. The RTX 2080 Ti will be twice as fast as the 1080 Ti. Price listed on NVIDIA website for founder’s edition is $1,199. RAM: 11 GB Memory bandwidth: 616 GBs/second Cores: 4352 cores @ 1545 MHz NVIDIA RTX 2080: This is more cost efficient than the 2080 Ti at a listed price of $799 on NVIDIA website for the founder’s edition. RAM: 8 GB Memory bandwidth: 448 GBs/second Cores: 2944 cores @ 1710 MHz NVIDIA RTX 2070: This is more cost efficient than the 2080 Ti at a listed price of $599 on NVIDIA website. Note that the other versions of the RTX cards will likely be cheaper than the founder’s edition around a $100 difference. RAM: 8 GB Memory bandwidth: 448 GBs/second Cores: 2304 cores @ 1620 MHz NVIDIA GTX 1080 Ti: Priced at $650 on Amazon. This is a higher end option but offers great value for money, and can also do well in Kaggle competitions. If you need more memory but cannot afford the RTX 2080 Ti go for this. RAM: 11 GB Memory bandwidth: 484 GBs/second Cores: 3584 cores @ 1582 MHz NVIDIA GTX 1080: Priced at $584 on Amazon. This is a mid-high end option only slightly behind the 1080Ti. VRAM: 8 GB Memory bandwidth: 320 GBs/second Processing power: 2560 cores @ 1733 MHz NVIDIA GTX 1070 Ti: Priced at around $450 on Amazon. This is slightly less performant than the GTX 1080 but $100 cheaper. VRAM: 8 GB Memory bandwidth: 256 GBs/second Processing power: 2438 cores @ 1683 MHz NVIDIA GTX 1070: Priced at $380 on Amazon is currently the bestseller because of crypto miners. Somewhat slower than the 1080 GPUs but cheaper. VRAM: 8 GB Memory bandwidth: 256 GBs/second Processing power: 1920 cores @ 1683 MHz NVIDIA GTX 1060 6GB: Priced at around $290 on Amazon. Pretty cheap but the 6 GB VRAM limits you. Should be good for NLP but you’ll find the performance lacking in computer vision. VRAM: 6 GB Memory bandwidth: 216 GBs/second Processing power: 1280 cores @ 1708 MHz NVIDIA GTX 1050 Ti: Priced at around $200 on Amazon. This is the cheapest workable option. Good to get started with deep learning and explore if you’re new. VRAM: 4 GB Memory bandwidth: 112 GBs/second Processing power: 768 cores @ 1392 MHz NVIDIA Titan XP: The Titan XP is also an option but gives only a marginally better performance while being almost twice as expensive as the GTX 1080 Ti, it has 12 GB memory, 547.7 GB/s bandwidth and 3840 cores @ 1582 MHz. On a side note, NVIDIA Quadro GPUs are pretty expensive and don’t really help in deep learning they are more of use in CAD and working with heavy graphics production tasks. The graph below does a pretty good job of visualizing how all the GPUs above compare: Source: Slav Ivanov Blog, processing power is calculated as CUDA cores times the clock frequency Does the number of GPUs matter? Yes, it does. But how many do you really need? What’s going to suit the scale of your project without breaking your budget? 2 GPUs will always yield better results than just one - but it’s only really worth it if you need the extra power. There are two options you can take with multi-GPU deep learning. On the one hand, you can train several different models at once across your GPUs, or, alternatively distribute one single training model across multiple GPUs known as  “multi-GPU training”. The latter approach is compatible with TensorFlow, CNTK, and PyTorch. Both of these approaches have advantages. Ultimately, it depends on how many projects you’re working on and, again, what your needs are. Another important point to bear in mind is that if you’re using multiple GPUs, the processor and hard disk need to be fast enough to feed data continuously - otherwise the multi-GPU approach is pointless. Source: NVIDIA website It boils down to your needs and budget, GPUs aren’t exactly cheap.   Other heavy devices There are also other large machines apart from GPUs. These include the specialized supercomputer from NVIDIA, the DGX-2, and Tensor processing units (TPUs) from Google. The NVIDIA DGX-2 If you thought GPUs were expensive, let me introduce you to NVIDIA DGX-2, the successor to the NVIDIA DGX-1. It’s a highly specialized workstation; consider it a supercomputer that has been specially designed to tackle deep learning. The price of the DGX-2 is (*gasp*) $399,000. Wait, what? I could buy some new hot wheels for that, or Dual Intel Xeon Platinum 8168, 2.7 GHz, 24-cores, 16 NVIDIA GPUs, 1.5 terabytes of RAM, and nearly 32 terabytes of SSD storage! The performance here is 2 petaFLOPS. Let’s be real: many of us probably won’t be able to afford it. However, NVIDIA does have leasing options, should you choose to try it. Practically speaking, this kind of beast finds its use in research work. In fact, the first DGX-1 was gifted to OpenAI by NVIDIA to promote AI research. Visit the NVIDIA website for more on these monster machines. There are also personal solutions available like the NVIDIA DGX Workstation. TPUs Now that you’ve caught your breath after reading about AI dream machines, let’s look at TPUs. Unlike the DGX machines, TPUs run on the cloud. A TPU is what’s referred to as an application-specific integrated circuit (ASIC) that has been designed specifically for machine learning and deep learning by Google. Here’s the key stats: Cloud TPUs can provide up to 11.5 petaflops of performance in a single pod. If you want to learn more, go to Google’s website. When choosing GPUs you need to weigh up your options The GTX 1080 Ti is most commonly used by researchers and competitively for Kaggle, as it gives good value for money. Go for this if you are sure about what you want to do with deep learning. The GTX 1080 and GTX 1070 Ti are cheaper with less computing power, a more budget friendly option if you cannot afford the 1080 Ti. GTX 1070 saves you some more money but is slower. The GTX 1060 6GB and GTX 1050 Ti are good if you’re just starting off in the world of deep learning without burning a hole in your pockets. If you must have the absolute best GPU irrespective of the cost then the RTX 2080 Ti is your choice. It offers twice the performance for almost twice the cost of a 1080 Ti. Nvidia unveils a new Turing architecture: “The world’s first ray tracing GPU” Nvidia GPUs offer Kubernetes for accelerated deployments of Artificial Intelligence workloads Nvidia’s Volta Tensor Core GPU hits performance milestones. But is it the best?
Read more
  • 0
  • 0
  • 15211

article-image-ai-rescue-5-ways-machine-learning-can-assist-emergency-situations
Sugandha Lahoti
16 Jan 2018
9 min read
Save for later

AI to the rescue: 5 ways machine learning can assist during emergency situations

Sugandha Lahoti
16 Jan 2018
9 min read
At the wee hours of the night, on January 4 this year, over 9.8 million people experienced a magnitude 4.4 earthquake that rumbled across the San Francisco Bay Area. This was followed by a magnitude 7.6 earthquake in the Caribbean sea on January 9, following which a tsunami advisory was in effect for Puerto Rico and the U.S. and British Virgin Islands. In the past 6 months, the United States alone has witnessed four back-to-back storms from one brutal hurricane season, and a massive wildfire with almost 2 million acres of land ablaze. Natural Disasters across the globe are increasingly becoming more damaging and frequent. Since 1970, the number of disasters worldwide has more than quadrupled to around 400 a year. These series of natural disasters have strained the emergency services and disaster relief operations beyond capacity. We now need to look for newer ways to assist the affected people and automate the recovery process. Artificial intelligence and machine learning have advanced to the state where they are highly proficient in making predictions, and in identification and classification tasks. These use cases of AI can also be applied to prevent disasters or respond quickly, in case of an emergency.   Here are 5 ways how AI can lend a helping hand during emergency situations. 1. Machine Learning for targeted disaster relief management In case of any disaster, the first step is to formulate a critical response team to help those in distress. Before the team goes into action, it is important to analyze and assess the extent of damage and to ensure that the right aid goes first to those who need it the most. AI techniques such as image recognition and classification can be quite helpful in assessing the damage as they can analyze and observe images from the satellites. They can immediately and efficiently filter these images, which would have required months to be sorted manually. AI can identify objects and features such as damaged buildings, flooding, blocked roads from these images. They can also identify temporary settlements which may indicate that people are homeless, and so the first care could be directed towards them.   Artificial intelligence and machine learning tools can also aggregate and crunch data from multiple resources such as crowd-sourced mapping materials or Google maps. Machine learning approaches then combine all this data together, remove unreliable data, and identify informative sources to generate heat maps. These heat maps can identify areas in need of urgent assistance and direct relief efforts to those areas. Heat maps are also helpful for government and other humanitarian agencies in deciding where to conduct aerial assessments. DigitalGlobe provides space imagery and geospatial content. Their Open Data Program is a special program for disaster response. The software learns how to recognize buildings on satellite photos by learning from the crowd. DigitalGlobe releases pre- and post-event imagery for select natural disasters each year, and their crowdsourcing platform, Tomnod, will prioritize micro-tasking to accelerate damage assessments. Following the Nepal Earthquakes in 2015, Rescue Global and academicians from the Orchid Project used machine learning to carry out rescue activities. They took pre and post-disaster imagery and utilized crowd-sourced data analysis and machine learning to identify locations affected by the quakes that had not yet been assessed or received aid. This information was then shared with relief workforces to facilitate their activities. 2. Next Generation 911 911 is the first source of contact during any emergency situation. 911 dispatch centers are already overloaded with calls on a regular day. In case of a disaster or calamity, the number gets quadrupled, or even more. This calls for augmenting traditional 911 emergency centers with newer technologies for better management. Traditional 911 centers rely on voice-based calls alone. Next-gen dispatch services are upgrading their emergency dispatch technology with machine learning to receive more types of data. So now they can ingest the data from not just calls but also from text, video, audio, and pictures, to analyze them to make quick assessments. The insights gained from all this information can be passed on to the emergency response teams out in the field to efficiently carry out critical tasks. The Association of Public-Safety Communications (APCO) have employed IBM’s Watson to listen to 911 calls. This initiative is to help emergency call centers improve operations and public safety by using Watson’s speech-to-text and analytics programs. Using Watson's speech-to-text function, the context of each call is fed into the AI's analytics program allowing improvements in how call centers respond to emergencies. It also helps in reducing call times, provide accurate information, and help accelerate time-sensitive emergency services. 3. Sentiment analysis on social media data for disaster management and recovery Social media channels are a major source of news in present times. Some of the most actionable information, during a disaster, comes from social media users. Real-time images and comments from Facebook, Twitter, Instagram, and YouTube can be analyzed and validated by AI to filter real information from fake ones. These vital stats can help on-the-ground aid workers to reach the point of crisis sooner and direct their efforts to the needy. This data can also help rescue workers in reducing the time needed to find victims. In addition, AI and predictive analytics software can analyze digital content from Twitter, Facebook, and Youtube to provide early warnings, ground-level location data, and real-time report verification. In fact, AI could also be used to view the unstructured data and background of pictures and videos posted to social channels and compare them to find missing people. AI-powered chatbots can help residents affected by a calamity. The chatbot can interact with the victim, or other citizens in the vicinity via popular social media channels and ask them to upload information such as location, a photo, and some description. The AI can then validate and check this information from other sources and pass on the relevant details to the disaster relief committee. This type of information can assist them with assessing damage in real time and help prioritize response efforts. AI for Digital Response (AIDR) is a free and open platform which uses machine intelligence to automatically filter and classify social media messages related to emergencies, disasters, and humanitarian crises. For this, it uses a Collector and a Tagger. The Collector helps in collecting and filtering tweets using keywords and hashtags such as "cyclone" and "#Irma," for example. The Collector works as a word-filter. The Tagger is a topic-filter which classifies tweets by topics of interest, such as "Infrastructure Damage," and "Donations," for example. The Tagger automatically applies the classifier to incoming tweets collected in real-time using the Collector. 4. AI answers distress and help-calls Emergency relief services are flooded with distress and help calls in the event of any emergency situation. Managing such a huge amount of calls is time-consuming and expensive when done manually. The chances of a critical information being lost or unobserved is also a possibility. In such cases, AI can work as a 24/7 dispatcher. AI systems and voice assistants can analyze massive amounts of calls, determine what type of incident occurred and verify the location. They can not only interact with callers naturally and process those calls, but can also instantly transcribe and translate languages. AI systems can analyze the tone of voice for urgency, filtering redundant or less urgent calls and prioritizing them based on the emergency. Blueworx is a powerful IVR platform which uses AI to replace call center officials. Using AI technology is especially useful when unexpected events such as natural disasters drive up call volume. Their AI engine is well suited to respond to emergency calls as unlike a call center agent, it can know who a customer is even before they call. It also provides intelligent call routing, proactive outbound notifications, unified messaging, and Interactive Voice Response. 5. Predictive analytics for proactive disaster management Machine learning and other data science approaches are not limited to assisting the on-ground relief teams or assisting only after the actual emergency. Machine learning approaches such as predictive analytics can also analyze past events to identify and extract patterns and populations vulnerable to natural calamities. A large number of supervised and unsupervised learning approaches are used to identify at-risk areas and improve predictions of future events. For instance, clustering algorithms can classify disaster data on the basis of severity. They can identify and segregate climatic patterns which may cause local storms with the cloud conditions which may lead to a widespread cyclone. Predictive machine learning models can also help officials distribute supplies to where people are going, rather than where they were by analyzing real-time behavior and movement of people. In addition, predictive analytics techniques can also provide insight for understanding the economic and human impact of natural calamities. Artificial neural networks take in information such as region, country, and natural disaster type to predict the potential monetary impact of natural disasters. Recent advances in cloud technologies and numerous open source tools have enabled predictive analytics with almost no initial infrastructure investment. So agencies with limited resources can also build systems based on data science and develop more sophisticated models to analyze disasters.   Optima Predict, a suite of software by Intermedix collects and reads information about disasters such as viral outbreaks or criminal activity in real time. The software spots geographical clusters of reported incidents before humans notice the trend and then alerts key officials about it. The data can also be synced with FirstWatch, which is an online dashboard for an EMS (Emergency Medical Services) personnel. Thanks to the multiple benefits of AI, government agencies and NGOs can start utilizing machine learning to deal with disasters. As AI and allied fields like robotics further develop and expand, we may see a fleet of drone services, equipped with sophisticated machine learning. These advanced drones could expedite access to real-time information at disaster sites using video capturing capabilities and also deliver lightweight physical goods to hard to reach areas. As with every progressing technology, AI will also build on its existing capabilities. It has the potential to eliminate outages before they are detected and give disaster response leaders an informed, clearer picture of the disaster area, ultimately saving lives.
Read more
  • 0
  • 0
  • 15152
article-image-cloud-computing-services-iaas-paas-saas
Amey Varangaonkar
07 Aug 2018
4 min read
Save for later

Types of Cloud Computing Services: IaaS, PaaS, and SaaS

Amey Varangaonkar
07 Aug 2018
4 min read
Cloud computing has risen massively in terms of popularity in recent times. This is due to the way it reduces on-premise infrastructure cost and improves efficiency. Primarily, the cloud model has been divided into three major service categories: Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) We will discuss each of these instances in the following sections: The article is an excerpt taken from the book 'Cloud Analytics with Google Cloud Platform', written by Sanket Thodge. Infrastructure as a Service (IaaS) Infrastructure as a Service often provides the infrastructure such as servers, virtual machines, networks, operating system, storage, and much more on a pay-as-you-use basis. IaaS providers offer VM from small to extra-large machines. The IaaS gives you complete freedom while choosing the instance type as per your requirements: Common cloud vendors providing the IaaS services are: Google Cloud Platform Amazon Web Services IBM HP Public Cloud Platform as a Service (PaaS) The PaaS model is similar to IaaS, but it also provides the additional tools such as database management system, business intelligence services, and so on. The following figure illustrates the architecture of the PaaS model: Cloud platforms providing PaaS services are as follows: Windows Azure Google App Engine Cloud Foundry Amazon Web Services Software as a Service (SaaS) Software as a Service (SaaS) makes the users connect to the products through the internet (or sometimes also help them build in-house as a private cloud solution) on a subscription basis model. Below image shows the basic architecture of SaaS model. Some cloud vendors providing SaaS are: Google Application Salesforce Zoho Microsoft Office 365 Differences between SaaS, PaaS, and IaaS The major differences between these models can be summarized to a table as follows: Software as a Service (SaaS) Platform as a Service (PaaS) Infrastructure as a Service (IaaS) Software as a service is a model in which a third-party provider hosts multiple applications and lets customers use them over the internet. SaaS is a very useful pay-as-you-use model. Examples: Salesforce, NetSuite This is a model in which a third-party provider application development platform and services built on its own infrastructure. Again these tools are made available to customers over the internet. Examples: Google App Engine, AWS Lambda In IaaS, a third-party application provides servers, storage, compute resources, and so on. And then makes it available for customers for their utilization. Customers can use IaaS to build their own PaaS and SaaS service for their customers. Examples: Google Cloud Compute, Amazon S3 How PaaS, IaaS, and SaaS are separated at a service level In this section, we are going to learn about how we can separate IaaS, PaaS, and SaaS at the service level: As the previous diagram suggests, we have the first column as OPS, which stands for operations. That means the bare minimum requirement for any typical server. When we are going with a server to buy, we should consider the preceding features before buying. It includes Application, Data, Runtime, Framework, Operating System, Server, Disk, and Network Stack. When we move to the cloud and decide to go with IaaS—in this case, we are not bothered about the server, disk, and network stack. Thus, the headache of handling hardware part is no more with us. That's why it is called Infrastructure as a Service. Now if we think of PaaS, we should not be worried about runtime, framework, and operating system along with the components in IaaS. Things that we need to focus on are only application and data. And the last deployment model is SaaS—Software as a Service. In this model, we are not concerned about literally anything. The only thing that we need to work on is the code and just a look at the bill. It's that simple! If you found the above excerpt useful, make sure to check out the book 'Cloud Analytics with Google Cloud Platform' for more of such interesting insights into Google Cloud Platform. Read more Top 5 cloud security threats to look out for in 2018 Is cloud mining profitable? Why AWS is the prefered cloud platform for developers working with big data?
Read more
  • 0
  • 0
  • 14956

article-image-the-evolution-cybercrime
Packt Editorial Staff
29 Mar 2018
4 min read
Save for later

The evolution of cybercrime

Packt Editorial Staff
29 Mar 2018
4 min read
A history of cybercrime As computer systems have now become integral to the daily functioning of businesses, organizations, governments, and individuals we have learned to put a tremendous amount of trust in these systems. As a result, we have placed incredibly important and valuable information on them. History has shown, that things of value will always be a target for a criminal. Cybercrime is no different. As people flood their personal computers, phones, and so on with valuable data, they put a target on that information for the criminal to aim for, in order to gain some form of profit from the activity. In the past, in order for a criminal to gain access to an individual's valuables, they would have to conduct a robbery in some shape or form. In the case of data theft, the criminal would need to break into a building, sifting through files looking for the information of greatest value and profit. In our modern world, the criminal can attack their victims from a distance, and due to the nature of the internet, these acts would most likely never meet retribution. Cybercrime in the 70s and 80s In the 70s, we saw criminals taking advantage of the tone system used on phone networks. The attack was called phreaking, where the attacker reverse-engineered the tones used by the telephone companies to make long distance calls. In 1988, the first computer worm made its debut on the internet and caused a great deal of destruction to organizations. This first worm was called the Morris worm, after its creator Robert Morris. While this worm was not originally intended to be malicious it still caused a great deal of damage. The U.S. Government Accountability Office in 1980 estimated that the damage could have been as high as $10,000,000.00. 1989 brought us the first known ransomware attack, which targeted the healthcare industry. Ransomware is a type of malicious software that locks a user's data, until a small ransom is paid, which will result in the issuance of a cryptographic unlock key. In this attack, an evolutionary biologist named Joseph Popp distributed 20,000 floppy disks across 90 countries, and claimed the disk contained software that could be used to analyze an individual's risk factors for contracting the AIDS virus. The disk however contained a malware program that when executed, displayed a message requiring the user to pay for a software license. Ransomware attacks have evolved greatly over the years with the healthcare field still being a very large target. The birth of the web and a new dawn for cybercrime The 90s brought the web browser and email to the masses, which meant new tools for cybercriminals to exploit. This allowed the cybercriminal to greatly expand their reach. Up till this time, the cybercriminal needed to initiate a physical transaction, such as providing a floppy disk. Now cybercriminals could transmit virus code over the internet in these new, highly vulnerable web browsers. Cybercriminals took what they had learned previously and modified it to operate over the internet, with devastating results. Cybercriminals were also able to reach out and con people from a distance with phishing attacks. No longer was it necessary to engage with individuals directly. You could attempt to trick millions of users simultaneously. Even if only a small percentage of people took the bait you stood to make a lot of money as a cybercriminal. The 2000s brought us social media and saw the rise of identity theft. A bullseye was painted for cybercriminals with the creation of databases containing millions of users' personal identifiable information (PII), making identity theft the new financial piggy bank for criminal organizations around the world. This information coupled with a lack of cybersecurity awareness from the general public allowed cybercriminals to commit all types of financial fraud such as opening bank accounts and credit cards in the name of others. Cybercrime in a fast-paced technology landscape Today we see that cybercriminal activity has only gotten worse. As computer systems have gotten faster and more complex we see that the cybercriminal has become more sophisticated and harder to catch. Today we have botnets, which are a network of private computers that are infected with malicious software and allow the criminal element to control millions of infected computer systems across the globe. These botnets allow the criminal element to overload organizational networks and hide the origin of the criminals: We see constant ransomware attacks across all sectors of the economy People are constantly on the lookout for identity theft and financial fraud Continuous news reports regarding the latest point of sale attack against major retailers and hospitality organizations This is an extract from Information Security Handbook by Darren Death. Follow Darren on Twitter: @DarrenDeath. 
Read more
  • 0
  • 2
  • 14923