Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Tech Guides

852 Articles
article-image-predictive-analytics-with-amazon-ml
Natasha Mathur
09 Aug 2018
9 min read
Save for later

Predictive Analytics with AWS: A quick look at Amazon ML

Natasha Mathur
09 Aug 2018
9 min read
As artificial intelligence and big data have become a ubiquitous part of our everyday lives, cloud-based machine learning services are part of a rising billion-dollar industry. Among the several services currently available in the market, Amazon Machine Learning stands out for its simplicity. In this article, we will look at Amazon Machine Learning, MLaaS, and other related concepts. This article is an excerpt taken from the book 'Effective Amazon Machine Learning' written by Alexis Perrier. Machine Learning as a Service Amazon Machine Learning is an online service by Amazon Web Services (AWS) that does supervised learning for predictive analytics. Launched in April 2015 at the AWS Summit, Amazon ML joins a growing list of cloud-based machine learning services, such as Microsoft Azure, Google prediction, IBM Watson, Prediction IO, BigML, and many others. These online machine learning services form an offer commonly referred to as Machine Learning as a Service or MLaaS following a similar denomination pattern of other cloud-based services such as SaaS, PaaS, and IaaS respectively for Software, Platform, or Infrastructure as a Service. Studies show that MLaaS is a potentially big business trend. ABI Research, a business intelligence consultancy, estimates machine learning-based data analytics tools and services revenues to hit nearly $20 billion in 2021 as MLaaS services take off as outlined in this business report  Eugenio Pasqua, a Research Analyst at ABI Research, said the following: "The emergence of the Machine-Learning-as-a-Service (MLaaS) model is good news for the market, as it cuts down the complexity and time required to implement machine learning and thus opens the doors to an increase in its adoption level, especially in the small-to-medium business sector." The increased accessibility is a direct result of using an API-based infrastructure to build machine-learning models instead of developing applications from scratch. Offering efficient predictive analytics models without the need to code, host, and maintain complex code bases lowers the bar and makes ML available to smaller businesses and institutions. Amazon ML takes this democratization approach further than the other actors in the field by significantly simplifying the predictive analytics process and its implementation. This simplification revolves around four design decisions that are embedded in the platform: A limited set of tasks: binary classification, multi-classification, and regression A single linear algorithm A limited choice of metrics to assess the quality of the prediction A simple set of tuning parameters for the underlying predictive algorithm That somewhat constrained environment is simple enough while addressing most predictive analytics problems relevant to business. It can be leveraged across an array of different industries and use cases. Let's see how! Leveraging full AWS integration The AWS data ecosystem of pipelines, storage, environments, and Artificial Intelligence (AI) is also a strong argument in favor of choosing Amazon ML as a business platform for its predictive analytics needs. Although Amazon ML is simple, the service evolves to greater complexity and more powerful features once it is integrated into a larger structure of AWS data related services. AWS is already a major factor in cloud computing. Here's what an excerpt from The Economist, August  2016 has to say about AWS (http://www.economist.com/news/business/21705849-how-open-source-software-and-cloud-computing-have-set-up-it-industry): AWS shows no sign of slowing its progress towards full dominance of cloud computing's wide skies. It has ten times as much computing capacity as the next 14 cloud providers combined, according to Gartner, a consulting firm. AWS's sales in the past quarter were about three times the size of its closest competitor, Microsoft's Azure. This gives an edge to Amazon ML, as many companies that are using cloud services are likely to be already using AWS. Adding simple and efficient machine learning tools to the product offering mix anticipates the rise of predictive analytics features as a standard component of web services. Seamless integration with other AWS services is a strong argument in favor of using Amazon ML despite its apparent simplicity. The following architecture is a case study taken from an AWS January 2016 white paper titled Big Data Analytics Options on AWS (http://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf), showing a potential AWS architecture for sentiment analysis on social media. It shows how Amazon ML can be part of a more complex architecture of AWS services: Comparing performances in Amazon ML services Keeping systems and applications simple is always difficult, but often worth it for the business. Examples abound with overloaded UIs bringing down the user experience, while products with simple, elegant interfaces and minimal features enjoy widespread popularity. The Keep It Simple mantra is even more difficult to adhere to in a context such as predictive analytics where performance is key. This is the challenge Amazon took on with its Amazon ML service. A typical predictive analytics project is a sequence of complex operations: getting the data, cleaning the data, selecting, optimizing and validating a model and finally making predictions. In the scripting approach, data scientists develop codebases using machine learning libraries such as the Python scikit-learn library or R packages to handle all these steps from data gathering to predictions in production. As a developer breaks down the necessary steps into modules for maintainability and testability, Amazon ML breaks down a predictive analytics project into different entities: datasource, model, evaluation, and predictions. It's the simplicity of each of these steps that makes AWS so powerful to implement successful predictive analytics projects. Engineering data versus model variety Having a large choice of algorithms for your predictions is always a good thing, but at the end of the day, domain knowledge and the ability to extract meaningful features from clean data is often what wins the game. Kaggle is a well-known platform for predictive analytics competitions, where the best data scientists across the world compete to make predictions on complex datasets. In these predictive competitions, gaining a few decimals on your prediction score is what makes the difference between earning the prize or being just an extra line on the public leaderboard among thousands of other competitors. One thing Kagglers quickly learn is that choosing and tuning the model is only half the battle. Feature extraction or how to extract relevant predictors from the dataset is often the key to winning the competition. In real life, when working on business-related problems, the quality of the data processing phase and the ability to extract meaningful signal out of raw data is the most important and time-consuming part of building an effective predictive model. It is well known that "data preparation accounts for about 80% of the work of data scientists" (http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/). Model selection and algorithm optimization remains an important part of the work but is often not the deciding factor when the implementation is concerned. A solid and robust implementation that is easy to maintain and connects to your ecosystem seamlessly is often preferred to an overly complex model developed and coded in-house, especially when the scripted model only produces small gains when compared to a service-based implementation. Amazon's expertise and the gradient descent algorithm Amazon has been using machine learning for the retail side of its business and has built a serious expertise in predictive analytics. This expertise translates into the choice of algorithm powering the Amazon ML service. The Stochastic Gradient Descent (SGD) algorithm is the algorithm powering Amazon ML linear models and is ultimately responsible for the accuracy of the predictions generated by the service. The SGD algorithm is one of the most robust, resilient, and optimized algorithms. It has been used in many diverse environments, from signal processing to deep learning and for a wide variety of problems, since the 1960s with great success. The SGD has also given rise to many highly efficient variants adapted to a wide variety of data contexts. We will come back to this important algorithm in a later chapter; suffice it to say at this point that the SGD algorithm is the Swiss army knife of all possible predictive analytics algorithm. Several benchmarks and tests of the Amazon ML service can be found across the web (Amazon, Google, and Azure: https://blog.onliquid.com/machine-learning-services-2/ and Amazon versus scikit-learn: http://lenguyenthedat.com/minimal-data-science-2-avazu/). Overall results show that the Amazon ML performance is on a par with other MLaaS platforms, but also with scripted solutions based on popular machine learning libraries such as scikit-learn. For a given problem in a specific context and with an available dataset and a particular choice of a scoring metric, it is probably possible to code a predictive model using an adequate library and obtain better performances than the ones obtained with Amazon ML. But what Amazon ML offers is stability, an absence of coding, and a very solid benchmark record, as well as a seamless integration with the Amazon Web Services ecosystem that already powers a large portion of the Internet. Amazon ML service pricing strategy As with other MLaaS providers and AWS services, Amazon ML only charges for what you consume. The cost is broken down into the following: An hourly rate for the computing time used to build predictive models A prediction fee per thousand prediction samples And in the context of real-time (streaming) predictions, a fee based on the memory allocated upfront for the model The computational time increases as a function of the following: The complexity of the model The size of the input data The number of attributes The number and types of transformations applied At the time of writing, these charges are as follows: $0.42 per hour for data analysis and model building fees $0.10 per 1,000 predictions for batch predictions $0.0001 per prediction for real-time predictions $0.001 per hour for each 10 MB of memory provisioned for your model These prices do not include fees related to the data storage (S3, Redshift, or RDS), which are charged separately. During the creation of your model, Amazon ML gives you a cost estimation based on the data source that has been selected. The Amazon ML service is not part of the AWS free tier, a 12-month offer applicable to certain AWS services for free under certain conditions. To summarize, we presented a simple introduction to the Amazon ML service. Amazon ML is built on a solid ground, with a simple yet very efficient algorithm driving its predictions. If you found this post useful, be sure to check out the book  'Effective Amazon Machine Learning' to learn about predictive analytics and other concepts in AWS machine learning. Integrate applications with AWS services: Amazon DynamoDB & Amazon Kinesis [Tutorial] AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers AWS Elastic Load Balancing: support added for Redirects and Fixed Responses in Application Load Balancer
Read more
  • 0
  • 0
  • 9419

article-image-how-will-ai-impact-job-roles-in-cybersecurity
Melisha Dsouza
25 Sep 2018
7 min read
Save for later

How will AI impact job roles in Cybersecurity

Melisha Dsouza
25 Sep 2018
7 min read
"If you want a job for the next few years, work in technology. If you want a job for life, work in cybersecurity." -Aaron Levie, chief executive of cloud storage vendor Box The field of cybersecurity will soon face some dire, but somewhat conflicting, views on the availability of qualified cybersecurity professionals over the next four or five years. Global Information Security Workforce Study from the Center for Cyber Safety and Education, predicts a shortfall of 1.8 million cybersecurity workers by 2022. The cybersecurity workforce gap will hit 1.8 million by 2022. On the flipside,  Cybersecurity Jobs Report, created by the editors of Cybersecurity Ventures highlight that there will be 3.5 million cybersecurity job openings by 2021. Cybercrime will feature more than triple the number of job openings over the next 5 years. Living in the midst of a digital revolution caused by AI- we can safely say that AI will be the solution to the dilemma of “what will become of human jobs in cybersecurity?”. Tech enthusiasts believe that we will see a new generation of robots that can work alongside humans and complement or, maybe replace, them in ways not envisioned previously. AI will not make jobs easier to accomplish, but also bring about new job roles for the masses. Let’s find out how. Will AI destroy or create jobs in Cybersecurity? AI-driven systems have started to replace humans in numerous industries. However, that doesn’t appear to be the case in cybersecurity. While automation can sometimes reduce operational errors and make it easier to scale tasks, using AI to spot cyberattacks isn’t completely practical because such systems yield a large number of false positives. It lacks the contextual awareness which can lead to attacks being wrongly identified or missed completely. As anyone who’s ever tried to automate something knows, automated machines aren’t great at dealing with exceptions that fall outside of the parameters to which they have been programmed. Eventually, human expertise is needed to analyze potential risks or breaches and make critical decisions. It’s also worth noting that completely relying on artificial intelligence to manage security only leads to more vulnerabilities - attacks could, for example, exploit the machine element in automation. Automation can support cybersecurity professionals - but shouldn’t replace them Supported by the right tools, humans can do more. They can focus on critical tasks where an automated machine or algorithm is inappropriate. In the context of cybersecurity, artificial intelligence can do much of the 'legwork' at scale in processing and analyzing data, to help inform human decision making. Ultimately, this isn’t a zero-sum game -  humans and AI can work hand in hand to great effects. AI2 Take, for instance, the project led by the experts at the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Lab. AI2 (Artificial Intelligence + Analyst Intuition) is a system that combines the capabilities of AI with the intelligence of human analysts to create an adaptive cybersecurity solution that improves over time. The system uses the PatternEx machine learning platform, and combs through data looking for meaningful, predefined patterns. For instance, a sudden spike in postback events on a webpage might indicate an attempt at staging a SQL injection attack. The top results are then presented to a human analyst, who will separate any false positives and flags legitimate threats. The information is then fed into a virtual analyst that uses human input to learn and improve the system’s detection rates. On future iterations, a more refined dataset is presented to the human analyst, who goes through results and once again "teaches" the system to make better decisions. AI2 is a perfect example that shows man and machine can complement each other’s strengths to create something even more effective. It’s worth remembering that in any company that uses AI for cybersecurity, automated tools and techniques require significant algorithm training, data markup. New cybersecurity job roles and the evolution of the job market The bottom line of this discussion is that- AI will not destroy cybersecurity jobs, but it will drastically change them. The primary focus of many cybersecurity jobs can be going through the hundreds of security tools available, determining what tools and techniques are most appropriate for their organization’s needs. Of course, as systems move to the cloud, these decisions will already be made because cloud providers will offer in-built security solutions. This means that the number of companies that will need a full staff of cybersecurity experts will be drastically reduced. Instead, companies will need more individuals that understand issues like the potential business impact and risk of different projects and architectural decisions. This demands a very different set of skills and knowledge compared to the typical current cybersecurity role - it is less directly technical and will require more integration with other key business decision makers. AI can provide assistance, but it can’t offer easy answers. Humans and AI working together Companies concerned with cybersecurity legal compliance and effective real-world solutions should note that cybersecurity and information technology professionals are best-suited for tasks such as risk analysis, policy formulation and cyber attack response. Human intervention is can help AI systems to learn and evolve. Take the example of Spain-based antivirus company called Panda Security, that had a number of people reverse-engineering malicious code and writing signatures. In today's times, to keep pace with overflowing amounts of data, the company would need hundreds of thousands of engineers to deal with malicious code. Enter AI and only a small team of engineers are required-- to look at more than 200,000 new malware samples per day. Is AI going to steal cybersecurity engineers their jobs? So what about former employees that used to perform this job? Have they been laid off? The answer is a straight No! But they will need to upgrade their skill set. In the world of cybersecurity, AI is going to create new jobs, as it throws up new problems to be analyzed and solved. It’s going to create what are being called "new collar" jobs - this is something that IBM’s hiring strategy has already taken into account. Once graduates enter the IBM workforce, AI enters the equation to help them get a fast start. Even the Junior analysts can have the ability to investigate a new malware infecting mobile phones of employees. AI would quickly research the new malware impacting the phones, and identify the characteristics reported by others and to provide a recommended course of action. This would relieve analysts from the manual work of going through reams of data and lines of code - in theory, it should make their job more interesting and more fun. Artificial intelligence and the human workforce, then, aren’t in conflict when it comes to cybersecurity. Instead, they can complement each other to create new job opportunities that will test the skills of the upcoming generation, and lead experienced professionals into new and maybe more interesting directions. It will be interesting to see how cybersecurity workforce makes use of AI in the future. Intelligent Edge Analytics: 7 ways machine learning is driving edge computing adoption in 2018 15 millions jobs in Britain at stake with AI robots set to replace humans at workforce 5 ways artificial intelligence is upgrading software engineering    
Read more
  • 0
  • 0
  • 9380

article-image-understanding-security-features-in-the-google-cloud-platform-gcp
Vincy Davis
27 Jul 2019
10 min read
Save for later

Understanding security features in the Google Cloud Platform (GCP)

Vincy Davis
27 Jul 2019
10 min read
Google's long experience and success in, protecting itself against cyberattacks plays to our advantage as customers of the Google Cloud Platform (GCP). From years of warding off security threats, Google is well aware of the security implications of the cloud model. Thus, they provide a well-secured structure for their operational activities, data centers, customer data, organizational structure, hiring process, and user support. Google uses a global scale infrastructure to provide security to build commercial services, such as Gmail, Google search, Google Photos, and enterprise services, such as GCP and gsuite. This article is an excerpt taken from the book, "Google Cloud Platform for Architects.", written by Vitthal Srinivasan, Janani Ravi and Et al. In this book, you will learn about Google Cloud Platform (GCP) and how to manage robust, highly available, and dynamic solutions to drive business objective. This article gives an insight into the security features in Google Cloud Platform, the tools that GCP provides for users benefit, as well as some best practices and design choices for security. Security features at Google and on the GCP Let's start by discussing what we get directly by virtue of using the GCP. These are security protections that we would not be able to engineer for ourselves. Let's go through some of the many layers of security provided by the GCP. Datacenter physical security: Only a small fraction of Google employees ever get to visit a GCP data center. Those data centers, the zones that we have been talking so much about, probably would seem out of a Bond film to those that did—security lasers, biometric detectors, alarms, cameras, and all of that cloak-and-dagger stuff. Custom hardware and trusted booting: A specific form of security attacks named privileged access attacks are on the rise. These involve malicious code running from the least likely spots that you'd expect, the OS image, hypervisor, or boot loader. There is the only way to really protect against these, which is to design and build every single element in-house. Google has done that, including hardware, a firmware stack, curated OS images, and a hardened hypervisor. Google data centers are populated with thousands of servers connected to a local network. Google selects and validates building components from vendors and designs custom secure server boards and networking devices for server machines. Google has cryptographic signatures on all low-level components, such as BIOS, bootloader, kernel, and base OS, to validate the correct software stack is booting up. Data disposal: The detritus of the persistent disks and other storage devices that we use are also cleaned thoroughly by Google. This data destruction process involves several steps: an authorized individual will wipe the disk clean using a logical wipe. Then, a different authorized individual will inspect the wiped disk. The results of the erasure are stored and logged too. Then, the erased driver is released into inventory for reuse. If the disk was damaged and could not be wiped clean, it is stored securely and not reused, and such devices are periodically destroyed. Each facility where data disposal takes place is audited once a week. Data encryption: By default GCP always encrypts all customer data at rest as well as in motion. This encryption is automatic, and it requires no action on the user's part. Persistent disks, for instance, are already encrypted using AES-256, and the keys themselves are encrypted with master keys. All these key management and rotation is managed by Google. In addition to this default encryption, a couple of other encryption options exist as well, more on those in the following diagram: Secure service deployment: Google's security documentation will often refer to secure service deployment, and it is important to understand that in this context, the term service has a specific meaning in the context of security: a service is the application binary that a developer writes and runs on infrastructure. This secure service deployment is based on three attributes: Identity: Each service running on Google infrastructure has an associated service account identity. A service has to submit cryptographic credentials provided to it to prove its identity while making or receiving remote procedure calls (RPC) to other services. Clients use these identities to make sure that they are connecting to an intended server and the server will use to restrict access to data and methods to specific clients. Integrity: Google uses a cryptographic authentication and authorization technique at an application layer to provide strong access control at the abstraction level for interservice communication. Google has an ingress and egress filtering facility at various points in their network to avoid IP spoofing. With this approach, Google is able to maximize their network's performance and its availability. Isolation: Google has an effective sandbox technique to isolate services running on the same machine. This includes Linux user separation, language and kernel-based sandboxes, and hardware virtualization. Google also secures operation of sensitive services such as cluster orchestration in GKE on exclusively dedicated machines. Secure interservice communication: The term inter-service communication refers to GCP's resources and services talking to each other. For doing so, the owners of the services have individual whitelists of services which can access them. Using them, the owner of the service can also allow some IAM identities to connect with the services managed by them.Apart from that, Google engineers on the backend who would be responsible to manage the smooth and downtime-free running of the services are also provided special identities to access the services (to manage them, not to modify their user-input data). Google encrypts interservice communication by encapsulating application layer protocols in RPS mechanisms to isolate the application layer and to remove any kind of dependency on network security. Using Google Front End: Whenever we want to expose a service using GCP, the TLS certificate management, service registration, and DNS are managed by Google itself. This facility is called the Google Front End (GFE) service. For example, a simple file of Python code can be hosted as an application on App Engine that (application) will have its own IP, DNS name, and so on. In-built DDoS protections: Distributed Denial-of-Service attacks are very well studied, and precautions against such attacks are already built into many GCP services, notably in networking and load balancing. Load balancers can actually be thought of as hardened, bastion hosts that serve as lightning rods to attract attacks, and so are suitably hardened by Google to ensure that they can withstand those attacks. HTTP(S) and SSL proxy load balancers, in particular, can protect your backend instances from several threats, including SYN floods, port exhaustion, and IP fragment floods. Insider risk and intrusion detection: Google constantly monitors activities of all available devices in Google infrastructure for any suspicious activities. To secure employees' accounts, Google has replaced phishable OTP second factors with U2F, compatible security keys. Google also monitors its customer devices that employees use to operate their infrastructure. Google also conducts a periodic check on the status of OS images with security patches on customer devices. Google has a special mechanism to grant access privileges named application-level access management control, which exposes internal applications to only specific users from correctly managed devices and expected network and geographic locations. Google has a very strict and secure way to manage its administrative access privileges. They have a rigorous monitoring process of employee activities and also a predefined limit for administrative accesses for employees. Google-provided tools and options for security As we've just seen, the platform already does a lot for us, but we still could end up leaving ourselves vulnerable to attack if we don't go about designing our cloud infrastructure carefully. To begin with, let's understand a few facilities provided by the platform for our benefit. Data encryption options: We have already discussed Google's default encryption; this encrypts pretty much everything and requires no user action. So, for instance, all persistent disks are encrypted with AES-256 keys that are automatically created, rotated, and themselves encrypted by Google. In addition to default encryption, there are a couple of other encryption options available to users. Customer-managed encryption keys (CMEK) using Cloud KMS: This option involves a user taking control of the keys that are used, but still storing those keys securely on the GCP, using the key management service. The user is now responsible for managing the keys that are for creating, rotating and destroying them. The only GCP service that currently supports CMEK is BigQuery and is in beta stage for Cloud Storage. Customer-supplied encryption keys (CSEK): Here, the user specifies which keys are to be used, but those keys do not ever leave the user's premises. To be precise, the keys are sent to Google as a part of API service calls, but Google only uses these keys in memory and never persists them on the cloud. CSEK is supported by two important GCP services: data in cloud storage buckets as well as by persistent disks on GCE VMs. There is an important caveat here though: if you lose your key after having encrypted some GCP data with it, you are entirely out of luck. There will be no way for Google to recover that data. Cloud security scanner: Cloud security scanner is a GCP, provided security scanner for common vulnerabilities. It has long been available for App Engine applications, but is now also available in alpha for Compute Engine VMs. This handy utility will automatically scan and detect the following four common vulnerabilities: Cross-site scripting (XSS) Flash injection Mixed content (HTTP in HTTPS) The use of outdated/insecure libraries Like most security scanners, it automatically crawls an application, follows links, and tries out as many different types of user input and event handlers as possible. Some security best practices Here is a list of design choices that you could exercise to cope with security threats such as DDoS attacks: Use hardened bastion hosts such as load balancers (particularly HTTP(S) and SSL proxy load balancers). Make good use of the firewall rules in your VPC network. Ensure that incoming traffic from unknown sources, or on unknown ports, or protocols is not allowed through. Use managed services such as Dataflow and Cloud Functions wherever possible; these are serverless and so have smaller attack vectors. If your application lends itself to App Engine it has several security benefits over GCE or GKE, and it can also be used to autoscale up quickly, damping the impact of a DDOS attack. If you are using GCE VMs, consider the use of API rate limits to ensure that the number of requests to a given VM does not increase in an uncontrolled fashion. Use NAT gateways and avoid public IPs wherever possible to ensure network isolation. Use Google CDN as a way to offload incoming requests for static content. In the event of a storm of incoming user requests, the CDN servers will be on the edge of the network, and traffic into the core infrastructure will be reduced. Summary In this article, you learned that the GCP benefits from Google's long experience countering cyber-threats and security attacks targeted at other Google services, such as Google search, YouTube, and Gmail. There are several built-in security features that already protect users of the GCP from several threats that might not even be recognized as existing in an on-premise world. In addition to these in-built protections, all GCP users have various tools at their disposal to scan for security threats and to protect their data. To know more in-depth about the Google Cloud Platform (GCP), head over to the book, Google Cloud Platform for Architects. Ansible 2 for automating networking tasks on Google Cloud Platform [Tutorial] Build Hadoop clusters using Google Cloud Platform [Tutorial] Machine learning APIs for Google Cloud Platform
Read more
  • 0
  • 0
  • 9345
Banner background image

article-image-looking-different-types-lookup-cache
Savia Lobo
20 Nov 2017
6 min read
Save for later

Looking at the different types of Lookup cache

Savia Lobo
20 Nov 2017
6 min read
[box type="note" align="" class="" width=""]The following is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. We walk through the various types of lookup cache based on how a cache is defined in this article.[/box] Cache is the temporary memory that is created when you execute a process. It is created automatically when a process starts and is deleted automatically once the process is complete. The amount of cache memory is decided based on the property you define at the transformation level or session level. You usually set the property as default, so as required, it can increase the size of the cache. If the size required for caching the data is more than the cache size defined, the process fails with the overflow error. There are different types of caches available. Building the Lookup Cache - Sequential or Concurrent You can define the session property to create the cache either sequentially or concurrently. Sequential cache When you select to create the cache sequentially, Integration Service caches the data in a row-wise manner as the records enter the lookup transformation. When the first record enters the lookup transformation, lookup cache gets created and stores the matching record from the lookup table or file in the cache. This way, the cache stores only the matching data. It helps in saving the cache space by not storing unnecessary data. Concurrent cache When you select to create cache concurrently, Integration service does not wait for the data to flow from the source; it first caches complete data. Once the caching is complete, it allows the data to flow from the source. When you select a concurrent cache, the performance enhances as compared to sequential cache since the scanning happens internally using the data stored in the cache. Persistent cache - the permanent one You can configure the cache to permanently save the data. By default, the cache is created as non-persistent, that is, the cache will be deleted once the session run is complete. If the lookup table or file does not change across the session runs, you can use the existing persistent cache. Suppose you have a process that is scheduled to run every day and you are using lookup transformation to lookup on the reference table that which is not supposed to change for six months. When you use non-persistent cache every day, the same data will be stored in the cache; this will waste time and space every day. If you select to create a persistent cache, the integration service makes the cache permanent in the form of a file in the $PMCacheDir location. So, you save the time every day, creating and deleting the cache memory. When the data in the lookup table changes, you need to rebuild the cache. You can define the condition in the session task to rebuild the cache by overwriting the existing cache. To rebuild the cache, you need to check the rebuild option on the session property. Sharing the cache - named or unnamed You can enhance the performance and save the cache memory by sharing the cache if there are multiple lookup transformations used in a mapping. If you have the same structure for both the lookup transformations, sharing the cache will help in enhancing the performance by creating the cache only once. This way, we avoid creating the cache multiple times, which in turn, enhances the performance. You can share the cache--either named or unnamed Sharing unnamed cache If you have multiple lookup transformations used in a single mapping, you can share the unnamed cache. Since the lookup transformations are present in the same mapping, naming the cache is not mandatory. Integration service creates the cache while processing the first record in first lookup transformation and shares the cache with other lookups in the mapping. Sharing named cache You can share the named cache with multiple lookup transformations in the same mapping or in another mapping. Since the cache is named, you can assign the same cache using the name in the other mapping. When you process the first mapping with lookup transformation, it saves the cache in the defined cache directory and with a defined cache file name. When you process the second mapping, it searches for the same location and cache file and uses the data. If the Integration service does not find the mentioned cache file, it creates the new cache. If you run multiple sessions simultaneously that use the same cache file, Integration service processes both the sessions successfully only if the lookup transformation is configured for read-only from the cache. If there is a scenario when both lookup transformations are trying to update the cache file or a scenario where one lookup is trying to read the cache file and other is trying to update the cache, the session will fail as there is conflict in the processing. Sharing the cache helps in enhancing the performance by utilizing the cache created. This way we save the processing time and repository space by not storing the same data multiple times for lookup transformations. Modifying cache - static or dynamic When you create a cache, you can configure them to be static or dynamic. Static cache A cache is said to be static if it does not change with the changes happening in the lookup table. The static cache is not synchronized with the lookup table. By default, Integration service creates a static cache. The Lookup cache is created as soon as the first record enters the lookup transformation. Integration service does not update the cache while it is processing the data. Dynamic cache A cache is said to be dynamic if it changes with the changes happening in the lookup table. The static cache is synchronized with the lookup table. You can choose from the lookup transformation properties to make the cache dynamic. Lookup cache is created as soon as the first record enters the lookup transformation. Integration service keeps on updating the cache while it is processing the data. The Integration service marks the record as an insert for the new row inserted in the dynamic cache. For the record that is updated, it marks the record as an update in the cache. For every record that doesn't change, the Integration service marks it as unchanged. You use the dynamic cache while you process the slowly changing dimension tables. For every record inserted in the target, the record will be inserted in the cache. For every record updated in the target, the record will be updated in the cache. A similar process happens for the deleted and rejected records.
Read more
  • 0
  • 0
  • 9284

article-image-what-is-security-chaos-engineering-and-why-is-it-important
Amrata Joshi
21 Nov 2018
6 min read
Save for later

What is security chaos engineering and why is it important?

Amrata Joshi
21 Nov 2018
6 min read
Chaos engineering is, at its root, all about stress testing software systems in order to minimize downtime and maximize resiliency. Security chaos engineering takes these principles forward into the domain of security. The central argument of security chaos engineering is that current security practices aren’t fit for purpose. “Despite spending more on security, data breaches are continuously getting bigger and more frequent across all industries” write Aaron Rinehart and Charles Nwatu in a post published on opensource.com in January 2018. “We hypothesize that a large portion of data breaches are caused not by sophisticated nation-state actors or hacktivists, but rather simple things rooted in human error and system glitches.” The rhetorical question they’re asking is clear: should we wait for an incident to happen in order to work on it? Or should we be looking at ways to prevent them from happening at all? Why do we need security chaos engineering today? There are two problems that make security chaos engineering so important today. One is the way in which security breaches and failures are understood culturally across the industry. Security breaches tend to be seen as either isolated attacks or ‘holes’ within software - anomalies that should have been thought of but weren’t. In turn, this leads to a spiral of failures. Rather than thinking about cybersecurity in a holistic and systematic manner, the focus is all too often on simply identifying weaknesses when they happen and putting changes in place to stop them from happening again. You can see this approach even in the way organizations communicate after high-profile attacks have taken place - ‘we’re taking steps to ensure nothing like this ever happens again.’ While that sentiment is important for both customers and shareholders to hear, it also betrays exactly the problems Rinehart, Wong and Nwatu appear to be talking about. The second problem is more about the nature of software today. As the world moves to distributed systems, built on a range of services, and with an extensive set of software dependencies, vulnerabilities naturally begin to increase too. “Where systems are becoming more and more distributed, ephemeral, and immutable in how they operate… it is becoming difficult to comprehend the operational state and health of our systems' security,” Rinehart and Nwatu explain. When you take the cultural issues and the evolution of software together, it becomes clear that the only way cybersecurity is going to properly tackle today’s challenges is by doing an extensive rethink of how and why things happen. What security chaos engineering looks like in practice If you want to think about what the transition to security chaos engineering actually means in practice, a good way to think about it is seeing it as a shift in mindset. It’s a mindset that doesn’t focus on isolated issues but instead on the overall health of the system. Essentially, you start with a different question: don’t ask ‘where are the potential vulnerabilities in our software’ ask ‘where are the potential points of failure in the system?’ Rinehart and Nwatu explain: “Failures we can consist not only of IT, business, and general human factors but also the way we design, build, implement, configure, operate, observe, and manage security controls. People are the ones designing, building, monitoring, and managing the security controls we put in place to defend against malicious attackers.” By focusing on questions of system design and decision making, you can begin to capture security threats that you might otherwise miss. So, while malicious attacks might account for 47% of all security breaches, human error and system glitches combined account for 53%. This means that while we’re all worrying about the hooded hacker that dominates stock imagery, someone made a simple mistake that just about any software-savvy criminal could take advantage of. How is security chaos engineering different from penetration testing? Security chaos engineering looks a lot like penetration testing, right? After all, the whole point of pentesting is, like chaos engineering, determining weaknesses before they can have an impact. But there are some important differences that shouldn’t be ignored. Again, the key difference is the mindset behind both. Penetration testing is, for the most part, an event. It’s something you do when you’ve updated or changed something significant. It also has a very specific purpose. That’s not a bad thing, but with such a well-defined testing context you might miss security issues that you hadn’t even considered. And if you consider the complexity of a given software system, in which its state changes according to the services and requests it is handling, it’s incredibly difficult - not to mention expensive - to pentest an application in every single possible state. Security chaos engineering tackles that by actively experimenting on the software system to better understand it. The context in which it takes place is wide-reaching and ongoing, not isolated and particular. ChaoSlingr, the security chaos engineering tool ChaoSlingr is perhaps the most prominent tool out there to help you actually do security chaos engineering. Built for AWS, it allows you to perform a number of different ‘security chaos experiments’ in the cloud. Essentially, ChaosSlingr pushes failures into the system in a way that allows you to not only identify security issues but also to better understand your infrastructure. This SlideShare deck, put together by Aaron Rinehart himself, is a good introduction to how it works in a little more detail. Security teams have typically always focused on preventive security measures. ChaosSlingr empowers teams to dig deeper into their systems and improve it in ways that mitigate security risks. It allows you to be proactive rather than reactive. The future is security chaos engineering Chaos engineering has not quite taken off - yet. But it’s clear that the principles behind it are having an impact across software engineering. In particular, at a time when ever-evolving software feels so vulnerable - fragile even - applying it to cybersecurity feels incredibly pertinent and important. It’s true that the shift in mindset is going to be tough. But if we can begin to distrust our assumptions, experiment on our systems, and try to better understand how and why they work the way they do, we are certainly moving towards a healthier and more secure software world. Chaos Conf 2018 Recap: Chaos engineering hits maturity as community moves towards controlled experimentation Chaos engineering platform Gremlin announces $18 million series B funding and new feature for “full-stack resiliency” Gremlin makes chaos engineering with Docker easier with new container discovery feature
Read more
  • 0
  • 0
  • 9213

article-image-what-are-generative-adversarial-networks-gans-and-how-do-they-work
Richard Gall
11 Sep 2018
3 min read
Save for later

What are generative adversarial networks (GANs) and how do they work? [Video]

Richard Gall
11 Sep 2018
3 min read
Generative adversarial networks, or GANs, are a powerful type of neural network used for unsupervised machine learning. Made up of two competing models which run in competition with one another, GANs are able to capture and copy variations within a dataset. They’re great for image manipulation and generation, but they can also be deployed for tasks like understanding risk and recovery in healthcare and pharmacology. GANs are actually pretty new - they were first introduced by Ian Goodfellow in 2014. Goodfellow developed them to tackle some of the issues with similar neural networks, including the Boltzmann machine and autoencoders. Both the Boltzmann machine and autoencoders use the Markov Decision Chain which has a pretty high computational cost. This efficiency gives engineers significant gains - which you need if you’re working at the cutting edge of artificial intelligence. How do Generative Adversarial Networks work? Let's start with a simple analogy. You have a painting - say the Mona Lisa - and we have a master forger who wants to create a duplicate painting. The forger does this by learning how the original painter - Leonardo Da Vinci - produced the painting. Meanwhile, you have an investigator trying to capture the forger and ‘second guess’ the rules the forger is learning. To map this onto the architecture of a GAN, the forger is the generator network, which learns the distribution of classes while the investigator is the discriminator network, which learning the boundaries between those classes - the formal ‘shape’ of the dataset. Applications of GANs Generative adversarial networks are used for a number of different applications. One of the best examples is a Google Brain project back in 2016 - researchers used GANs to develop a method of encryption. This project used 3 neural networks - Alice, Bob, and Eve. Alice’s job was to send an encrypted message to Bob. Bob’s job was to decode that message, while Eve’s job was to intercept it. To begin with Alice’s messages were easily intercepted by Eve. However, thanks to Eve’s adversarial work, Alice began to develop its own encryption strategy - it took 15,000 runs for Alice to successfully encrypt a message that could be deciphered by Bob that Eve couldn’t intercept. Elsewhere, GANs are also being used in fields such as drug research. The neural networks can be trained on the existing drugs and suggest new synthetic chemical structures that improve on drugs that already exist. Generative adversarial networks: the cutting edge of artificial intelligence As we’ve seen, GANs offer some really exciting opportunities in artificial intelligence. There are two key advantages you need to remember: GANs solve the problem of generating data when you don’t have enough to begin with and they require no human supervision. This is crucial when you think about the cutting edge of artificial intelligence, both in terms of the efficiency of running the models, and the real-world data we want to use - which could be poor quality or have privacy and confidentiality issues, as much healthcare data does.
Read more
  • 0
  • 0
  • 9193
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-5-ways-artificial-intelligence-is-transforming-the-gaming-industry
Amey Varangaonkar
01 Dec 2017
7 min read
Save for later

5 Ways Artificial Intelligence is Transforming the Gaming Industry

Amey Varangaonkar
01 Dec 2017
7 min read
Imagine yourself playing a strategy game, like Age of Empires perhaps. You are in a world that looks real and you are pitted against the computer, and your mission is to protect your empire and defeat the computer, at the same time. What if you could create an army of soldiers who could explore the map and attack the enemies on their own, based on just a simple command you give them? And what if your soldiers could have real, unscripted conversations with you as their commander-in-chief to seek instructions? And what if the game’s scenes change spontaneously based on your decisions and interactions with the game elements, like a movie? Sounds too good to be true? It’s not far-fetched at all - thanks to the rise of Artificial Intelligence! The gaming industry today is a market worth over a hundred billion dollars. The Global Games Market Report says that about 2.2 billion gamers across the world are expected to generate an incredible $108.9 billion in game revenue by the end of 2017. As such, gaming industry giants are seeking newer and more innovative ways to attract more customers and expand their brands. While terms like Virtual Reality, Augmented Reality and Mixed Reality come to mind immediately as the future of games, the rise of Artificial Intelligence is an equally important stepping stone in making games smarter and more interactive, and as close to reality as possible. In this article, we look at the 5 ways AI is revolutionizing the gaming industry, in a big way! Making games smarter While scripting is still commonly used for control of NPCs (Non-playable character) in many games today, many heuristic algorithms and game AIs are also being incorporated for controlling these NPCs. Not just that, the characters also learn from the actions taken by the player and modify their behaviour accordingly. This concept can be seen implemented in Nintendogs, a real-time pet simulation video game by Nintendo. The ultimate aim of the game creators in the future will be to design robust systems within games that understand speech, noise and other sounds within the game and tweak the game scenario accordingly. This will also require modern AI techniques such as pattern recognition and reinforcement learning, where the characters within the games will self-learn from their own actions and evolve accordingly. The game industry has identified this and some have started implementing these ideas - games like F.E.A.R and The Sims are a testament to this. Although the adoption of popular AI techniques in gaming is still quite limited, their possible applications in the near-future has the entire gaming industry buzzing. Making games more realistic This is one area where the game industry has grown leaps and bounds over the last 10 years. There have been incredible advancements in 3D visualization techniques, physics-based simulations and more recently, inclusion of Virtual Reality and Augmented Reality in games. These tools have empowered game developers to create interactive, visually appealing games which one could never imagine a decade ago. Meanwhile, gamers have evolved too. They don’t just want good graphics anymore; they want games to resemble reality. This is a massive challenge for game developers, and AI is playing a huge role in addressing this need. Imagine a game which can interpret and respond to your in-game actions, anticipate your next move and act accordingly. Not the usual scripts where an action X will give a response Y, but an AI program that chooses the best possible alternative to your action in real-time, making the game more realistic and enjoyable for you. Improving the overall gaming experience Let’s take a real-world example here. If you’ve played EA Sports’ FIFA 17, you may be well-versed with their Ultimate Team mode. For the uninitiated, it’s more of a fantasy draft, where you can pick one of the five player choices given to you for each position in your team, and the AI automatically determines the team chemistry based on your choices. The team chemistry here is important, because the higher the team chemistry, the better the chances of your team playing well. The in-game AI also makes the playing experience better by making it more interactive. Suppose you’re losing a match against an opponent - the AI reacts by boosting your team’s morale through increased fan chants, which in turn affects player performances positively. Gamers these days pay a lot of attention to detail - this not only includes the visual appearance and the high-end graphics, but also how immersive and interactive the game is, in all possible ways. Through real-time customization of scenarios, AI has the capability to play a crucial role in taking the gaming experience to the next level. Transforming developer skills The game developer community have always been innovators in adopting cutting edge technology to hone their technical skills and creativity. Reinforcement Learning, a sub-set of Machine Learning, and the algorithm behind the popular AI computer program AlphaGo, that beat the world’s best human Go player is a case in point. Even for the traditional game developers, the rising adoption of AI in games will mean a change in the way games are developed. In an interview with Gamasutra, AiGameDev.com’s Alex Champandard says something interesting: “Game design that hinges on more advanced AI techniques is slowly but surely becoming more commonplace. Developers are more willing to let go and embrace more complex systems.” It’s safe to say that the notion of Game AI is changing drastically. Concepts such as smarter function-based movements, pathfinding, inclusion of genetic algorithms and rule-based AI such as fuzzy logic are being increasingly incorporated in games, although not at a very large scale. There are some implementation challenges currently as to how academic AI techniques can be brought more into games, but with time these AI algorithms and techniques are expected to embed more seamlessly with traditional game development skills. As such, in addition to knowledge of traditional game development tools and techniques, game developers will now have to also skill up on these AI techniques to make smarter, more realistic and more interactive games. Making smarter mobile games The rise of the mobile game industry today is evident from the fact that close to 50% of the game revenue in 2017 will come from mobile games - be it smartphones or tablets. The increasingly high processing power of these devices has allowed developers to create more interactive and immersive mobile games. However, it is important to note that the processing power of the mobile games is yet to catch up to their desktop counterparts, not to mention the lack of a gaming console, which is beyond comparison at this stage. To tackle this issue, mobile game developers are experimenting with different machine learning and AI algorithms to impart ‘smartness’ to mobile games, while still adhering to the processing power limits. Compare today’s mobile games to the ones 5 years back, and you’ll notice a tremendous shift in terms of the visual appearance of the games, and how interactive they have become. New machine learning and deep learning frameworks & libraries are being developed to cater specifically to the mobile platform. Google’s TensorFlow Lite and Facebook’s Caffe2 are instances of such development. Soon, these tools will come to developers’ rescue to build smarter and more interactive mobile games. In Conclusion Gone are the days when games were just about entertainment and passing time. The gaming industry is now one of the most profitable industries of today. As it continues to grow, the demands of the gaming community and the games themselves keep evolving. The need for realism in games is higher than ever, and AI has an important role to play in making games more interactive, immersive and intelligent. With the rate at which new AI techniques and algorithms are developing, it’s an exciting time for game developers to showcase their full potential. Are you ready to start building AI for your own games? Here are some books to help you get started: Practical Game AI Programming Learning game AI programming with Lua
Read more
  • 0
  • 0
  • 9186

article-image-what-software-stack-does-airbnb-use
Richard Gall
20 Aug 2017
4 min read
Save for later

What software stack does Airbnb use?

Richard Gall
20 Aug 2017
4 min read
Airbnb is one of the most disruptive organizations of the last decade. Since its inception in 2008, the company has developed a platform that allows people to ‘belong anywhere’ (to quote their own mission statement). In doing so, the very nature of tourism has changed. But what software does Airbnb use? What tools are enabling their level of innovation? How Airbnb develops a dynamic front end Let’s start with  the key challenge for Airbnb. Like many similar platforms, one of the central difficulties handling data in a way that’s incredibly dynamic. That means you need to ensure your JavaScript is working hard for you without taking too much strain. That’s where a reactive approach comes in. As an asynchronous paradigm, it’s able to manage how data moves from source to the components that react to it. But the paradigm can only do so much. By using ReactJS, Airbnb have a library that is capable of giving you the necessary dynamism in your UI. The Airbnb team have written a lot on their love for ReactJS, making it their canonical front end framework in 2015. But they’ve also built a large number of other tools around React to make life easier for their engineers. In this post, for example, the team discuss React Sketch.app which ‘allows you to write React components that render to Sketch documents.’ Elsewhere, Ruby also forms an important part of the development stack. However, as with React, the team are committed to innovating with the tools at their disposal. In this post, they discuss how the built a ‘blazing fast thrift bindings for Ruby with C extensions.’ How Airbnb manages data If managing data on the front end has been a crucial part of their software consideration, what about the tools that actually manage and store data? The company use MySQL to manage core business data; this hasn’t been without challenges - not least because of scalability. However, the team have found ways of making MySQL work to their advantage. Redis is also worth a mention here - read here how Airbnb use Redis to monitor customer issues at scale. But Airbnb have always been a big data company at heart - that’s why Hadoop is so important to their data infrastructure. A number of years ago, Airbnb ran Hadoop on Mesos which allows you to deploy a single configuration on different servers; this worked for a while, but owing to a number of challenges, (which you can read about here) the team moved away from Mesos, running a more straightforward Hadoop infrastructure. Spark is also an important tool for Airbnb. The team actually built something called Airstream, which is a computational framework that sits on top of Spark Streaming and Spark SQL, allowing engineers and the data team to get quick insights. Ultimately, for an organization that depends on predictions and machine learning, something like Spark - alongside other open source machine learning libraries - is crucial in the Airbnb stack. Cloud - how Airbnb takes advantage of AWS If you take a close look at how they work, the Airbnb team have a true hacker mentality, where it’s about playing, building, creating new tools to tackle new challenges. This has arguably been enabled by the way they use AWS. It’s perhaps no coincidence that around the time Airbnb was picking up speed and establishing itself that the Amazon cloud offering was reaching maturity. Airbnb adopted a number of AWS services such as S3 and EC2 early on. But the reason Airbnb have stuck with AWS comes down to cultural fit. “For us, an investment in AWS is really about making sure our engineers are focused on the things that are uniquely core to our business. Everything that we do in engineering is ultimately about creating great matches between people,” Kevin Rice, Director of Engineering has said. How Airbnb creates a DevOps culture But there’s more to it than AWS; there’s a real DevOps culture inside Airbnb that further facilitates a mixture of agility and creativity. The tools used for DevOps are an interesting mix - some of which are unsurprising - like GitHub, and Nginx (which powers some of the busiest sites on the planet), but some slightly more surprising features, such as Kibana, which is used by the company to monitor data alongside Elasticsearch. When it comes to developing and provisioning environments, Airbnb use Vagrant and Chef. It’s easy to see the benefits here - it makes setting up and configuring environments incredibly easy and fast. And if you’re going to live by the principles of DevOps, this is essential - it’s the foundation of everything you do.
Read more
  • 0
  • 0
  • 9160

article-image-5-best-practices-to-perform-data-wrangling-with-python
Savia Lobo
18 Oct 2018
5 min read
Save for later

5 best practices to perform data wrangling with Python

Savia Lobo
18 Oct 2018
5 min read
Data wrangling is the process of cleaning and structuring complex data sets for easy analysis and making speedy decisions in less time. Due to the internet explosion and the huge trove of IoT devices there is a massive availability of data, at present. However, this data is most often in its raw form and includes a lot of noise in the form of unnecessary data, broken data, and so on. Clean up of this data is essential in order to use it for analysis by organizations. Data wrangling plays a very important role here by cleaning this data and making it fit for analysis. Also, Python language has built-in features to apply any wrangling methods to various data sets to achieve the analytical goal. Here are 5 best practices that will help you out in your data wrangling journey with the help of Python. And at the end, all you’ll have is a clean and ready to use data for your business needs. 5 best practices for data wrangling with Python Learn the data structures in Python really well Designed to be a very high-level language, Python offers an array of amazing data structures with great built-in methods. Having a solid grasp of all the capabilities will be a potent weapon in your repertoire for handling data wrangling task. For example, dictionary in Python can act almost like a mini in-memory database with key-value pairs. It supports extremely fast retrieval and search by utilizing a hash table underneath. Explore other built-in libraries related to these data structures e.g. ordered dict, string library for advanced functions. Build your own version of essential data structures like stack, queues, heaps, and trees, using classes and basic structures and keep them handy for quick data retrieval and traversal. Learn and practice file and OS handling in Python How to open and manipulate files How to manipulate and navigate directory structure Have a solid understanding of core data types and capabilities of Numpy and Pandas How to create, access, sort, and search a Numpy array. Always think if you can replace a conventional list traversal (for loop) with a vectorized operation. This will increase speed of your data operation. Explore special file types like .npy (Numpy’s native storage) to access/read large data set with much higher speed than usual list. Know in details all the file types you can read using built-in Pandas methods. This will simplify to a great extent your data scraping. Almost all of these methods have great data cleaning and other checks built in. Try to use such optimized routines instead of writing your own to speed up the process. Build a good understanding of basic statistical tests and a panache for visualization Running some standard statistical tests can quickly give you an idea about the quality of the data you need to wrangle with. Plot data often even if it is multi-dimensional. Do not try to create fancy 3D plots. Learn to explore simple set of pairwise scatter plots. Use boxplots often to see the spread and range of the data and detect outliers. For time-series data, learn basic concepts of ARIMA modeling to check the sanity of the data Apart from Python, if you want to master one language, go for SQL As a data engineer, you will inevitably run across situations where you have to read from a large, conventional database storage. Even if you use Python interface to access such database, it is always a good idea to know basic concepts of database management and relational algebra. This knowledge will help you build on later and move into the world of Big Data and Massive Data Mining (technologies like Hadoop/Pig/Hive/Impala) easily. Your basic data wrangling knowledge will surely help you deal with such scenarios. Although Data wrangling may be the most time-consuming process, it is the most important part of the data management. Data collected by businesses on a daily basis can help them make decisions on the latest information available. It also allows businesses to find the hidden insights and use it in the decision-making processes and provide them with new analytic initiatives, improved reporting efficiency and much more. About the authors Dr. Tirthajyoti Sarkar works in San Francisco Bay area as a senior semiconductor technologist where he designs state-of-the-art power management products and applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He has 15+ years of R&D experience and is a senior member of IEEE. Shubhadeep Roychowdhury works as a Sr. Software Engineer at a Paris based Cyber Security startup where he is applying the state-of-the-art Computer Vision and Data Engineering algorithms and tools to develop cutting edge product. Data cleaning is the worst part of data analysis, say data scientists Python, Tensorflow, Excel and more – Data professionals reveal their top tools Manipulating text data using Python Regular Expressions (regex)
Read more
  • 0
  • 0
  • 9102

article-image-self-service-analytics-changing-modern-day-businesses
Amey Varangaonkar
20 Nov 2017
6 min read
Save for later

How self-service analytics is changing modern-day businesses

Amey Varangaonkar
20 Nov 2017
6 min read
To stay competitive in today’s economic environment, organizations can no longer be reliant on just their IT team for all their data consumption needs. At the same time, the need to get quick insights to make smarter and more accurate business decisions is now stronger than ever. As a result, there has been a sharp rise in a new kind of analytics where the information seekers can themselves create and access a specific set of reports and dashboards - without IT intervention. This is popularly termed as Self-service Analytics. Per Gartner, Self-service analytics is defined as: “A  form of business intelligence (BI) in which line-of-business professionals are enabled and encouraged to perform queries and generate reports on their own, with nominal IT support.” Expected to become a $10 billion market by 2022, self-service analytics is characterized by simple, intuitive and interactive BI tools that have basic analytic and reporting capabilities with a focus on easy data access. It empowers business users to access relevant data and extract insights from it without needing to be an expert in statistical analysis or data mining. Today, many tools and platforms for self-service analytics are already on the market - Tableau, Microsoft Power BI, IBM Watson, Qlikview and Qlik Sense being some of the major ones. Not only have these empowered users to perform all kinds of analytics with accuracy, but their reasonable pricing, in-tool guidance and the sheer ease of use have also made them very popular among business users. Rise of the Citizen Data Scientist The rise in popularity of self-service analytics has led to the coining of a media-favored term - ‘Citizen Data Scientist’. But what does the term mean? Citizen data scientists are business users and other professionals who can perform less intensive data-related tasks such as data exploration, visualization and reporting on their own using just the self-service BI tools. If Gartner’s predictions are to be believed, there will be more citizen data scientists in 2019 than the traditional data scientists who will be performing a variety of analytics-related tasks. How Self-service Analytics benefits businesses Allowing the end-users within a business to perform their own analysis has some important advantages as compared to using the traditional BI platforms: The time taken to arrive at crucial business insights is drastically reduced. This is because teams don’t have to rely on the IT team to deliver specific reports and dashboards based on the organizational data. Quicker insights from self-service BI tools mean businesses can take decisions faster with higher confidence and deploy appropriate strategies to maximize business goals. Because of the relative ease of use, business users can get up to speed with the self-service BI tools/platform in no time and with very little training as compared to being trained on complex BI solutions. This means relatively lower training costs and democratization of BI analytics which in turn reduces the workload on the IT team and allows them to focus on their own core tasks. Self-service analytics helps the users to manage the data from disparate sources more efficiently, thus allowing organizations to be agiler in terms of handling new business requirements. Challenges in Self-service analytics While the self-service analytics platforms offer many benefits, they come with their own set of challenges too.  Let’s see some of them: Defining a clear role for the IT team within the business by addressing concerns such as: Identifying the right BI tool for the business - Among the many tools out there, identifying which tool is the best fit is very important. Identifying which processes and business groups can make the best use of self-service BI and who may require assistance from IT Setting up the right infrastructure and support system for data analysis and reporting Answering questions like - who will design complex models and perform high-level data analysis Thus, rather than becoming secondary to the business, the role of the IT team becomes even more important when adopting a self-service business intelligence solution. Defining a strict data governance policy - This is a critical task as unauthorized access to organizational data can be detrimental to the business. Identifying the right ‘power users’, i.e., the users who need access to the data and the tools, the level of access that needs to be given to them, and ensuring the integrity and security of the data are some of the key factors that need to be kept in mind. The IT team plays a major role in establishing strict data governance policies and ensuring the data is safe, secure and shared across the right users for self-service analytics. Asking the right kind of questions on the data - When users who aren’t analysts get access to data and the self-service tools, asking the right questions of the data in order to get useful, actionable insights from it becomes highly important. Failure to perform correct analysis can result in incorrect or insufficient findings, which might lead to wrong decision-making. Regular training sessions and support systems in place can help a business overcome this challenge. To read more about the limitations of self-service BI, check out this interesting article. In Conclusion IDC has predicted that spending on self-service BI tools will grow 2.5 times than spending on traditional IT-controlled BI tools by 2020. This is an indicator that many organizations worldwide and of all sizes will increasingly believe that self-service analytics is a feasible and profitable way to go forward. Today mainstream adoption of self-service analytics still appears to be in the early stages due to a general lack of awareness among businesses. Many organizations still depend on the IT team or an internal analytics team for all their data-driven decision-making tasks. As we have already seen, this comes with a lot of limitations - limitations that can easily be overcome by the adoption of a self-service culture in analytics, and thus boost the speed, ease of use and quality of the analytics. By shifting most of the reporting work to the power users,  and by establishing the right data governance policies, businesses with a self-service BI strategy can grow a culture that fuels agile thinking, innovation and thus is ready for success in the marketplace. If you’re interested in learning more about popular self-service BI tools, these are some of our premium products to help you get started:   Learning Tableau 10 Tableau 10 Business Intelligence Cookbook Learning IBM Watson Analytics QlikView 11 for Developers Microsoft Power BI Cookbook    
Read more
  • 0
  • 0
  • 9101
article-image-how-ai-is-going-to-transform-the-data-center
Melisha Dsouza
13 Sep 2018
7 min read
Save for later

How AI is going to transform the Data Center

Melisha Dsouza
13 Sep 2018
7 min read
According to Gartner analyst Dave Cappuccio, 80% of enterprises will have shut down their traditional data centers by 2025, compared to just 10% today. The figures are fitting considering the host of problems faced by traditional data centers. The solution to all these problems lies right in front of us- Incorporating Intelligence in traditional data centers. To support this claim, Gartner also predicts that by 2020, more than 30 percent of data centers that fail to implement AI and Machine Learning will cease to be operationally and economically viable. Across the globe, Data science and AI are influencing the design and development of the modern data centers. With the surge in the amount of data everyday, traditional data centers will eventually get slow and result in an inefficient output. Utilizing AI in ingenious ways, data center operators can drive efficiencies up and costs down. A fitting example of this is the tier-two automated control system implemented at Google to cool its data centers autonomously. The system makes all the cooling-plant tweaks on its own, continuously, in real-time- thus saving up to 30% of the plant’s energy annually. Source: DataCenter Knowledge AI has enabled data center operators to add more workloads on the same physical silicon architecture. They can aggregate and analyze data quickly and generate productive outputs, which is specifically beneficial to companies that deal with immense amounts of data like hospitals, genomic systems, airports, and media companies. How is AI facilitating data centers Let's look at some of the ways that Intelligent data centers will serve as a solution to issues faced by traditionally operated data centers. #1 Energy Efficiency The Delta Airlines data center outage in 2016, that was attributed to electrical-system failure over a three day period, cost the airlines around $150 million, grounding about 2,000 flights. This situation could have been easily averted had the data centers used Machine Learning for their workings. As data centers get ever bigger, more complex and increasingly connected to the cloud, artificial intelligence is becoming an essential tool for keeping things from overheating and saving power at the same time. According to the Energy Department’s U.S. Data Center Energy Usage Report, the power usage of data centers in the United States has grown at about a 4 percent rate annually since 2010 and is expected to hit 73 billion kilowatt-hours by 2020, more than 1.8 percent of the country’s total electricity use. Data centers also contribute about 2 percent of the world’s greenhouse gas emissions, AI techniques can do a lot to make processes more efficient, more secure and less expensive. One of the keys to better efficiency is keeping things cool, a necessity in any area of computing. Google and DeepMind (Alphabet Inc.’s AI division)use of AI to directly control its data center has reduced energy use for cooling by about 30 percent. #2 Server optimization Data centers have to maintain physical servers and storage equipment. AI-based predictive analysis can help data centers distribute workloads across the many servers in the firm. Data center loads can become more predictable and more easily manageable. Latest load balancing tools with built-in AI capabilities are able to learn from past data and run load distribution more efficiently. Companies will be able to better track server performance, disk utilization, and network congestions. Optimizing server storage systems, finding possible fault points in the system, improve processing times and reducing risk factors will become faster. These will in turn facilitate maximum possible server optimization. #3 Failure prediction / troubleshooting Unplanned downtime in a datacenter can lead to money loss. Datacenter operators need to quickly identify the root case of the failure, so they can prioritize troubleshooting and get the datacenter up and running before any data loss or business impact take place. Self managing datacenters make use of AI based deep learning (DL) applications to predict failures ahead of time. Using ML based recommendation systems, appropriate fixes can be inferred upon the system in time. Take for instance the HPE artificial intelligence predictive engine that identifies and solves trouble in the data center. Signatures are built to identify other users that might be affected. Rules are then developed to instigate a solution, which can be automated. The AI-machine learning solution, can quickly interject through the entire system and stop others from inheriting the same issue. #4 Intelligent Monitoring and storing of Data Incorporating machine learning, AI can take over the mundane job of monitoring huge amounts of data and make IT professionals more efficient in terms of the quality of tasks they handle. Litbit has developed the first AI-powered, data center operator, Dac. It uses a human-to-machine learning interface that combines existing human employee knowledge with real-time data. Incorporating over 11,000 pieces of innate knowledge, Dac has the potential to hear when a machine is close to failing, feel vibration patterns that are bad for HDD I/O, and spot intruders. Dac is proof of how AI can help monitor networks efficiently. Along with monitoring of data, it is also necessary to be able to store vast amounts of data securely. AI holds the potential to make more intelligent decisions on - storage optimization or tiering. This will help transform storage management by learning IO patterns and data lifecycles, helping storage solutions etc. Mixed views on the future of AI in data centers? Let’s face the truth, the complexity that comes with huge masses of data is often difficult to handle. Humans ar not as scalable as an automated solution to handle data with precision and efficiency. Take Cisco’s M5 Unified Computing or HPE’s InfoSight as examples. They are trying to alleviate the fact that humans are increasingly unable to deal with the complexity of a modern data center. One of the consequences of using automated systems is that there is always a possibility of humans losing their jobs and being replaced by machines at varying degrees depending on the nature of job roles. AI is predicted to open its doors to robots and automated machines that will soon perform repetitive tasks in the datacenters. On the bright side, organizations could allow employees, freed from repetitive and mundane tasks, to invest their time in more productive, and creative aspects of running a data center. In addition to new jobs, the capital involved in setting up and maintaining a data center is huge. Now add AI to the Datacenter and you have to invest double or maybe triple the amount of money to keep everything running smoothly. Managing and storing all of the operational log data for analysis also comes with its own set of issues. The log data that acts as the input to these ML systems becomes a larger data set than the application data itself. Hence firms need a proper plan in place to manage all of this data. Embracing AI in data centers would mean greater financial benefits from the outset while attracting more customers. It would be interesting to see the turnout of tech companies following Google’s footsteps and implementing AI in their data centers. Tech companies should definitely watch this space to take their data center operations up a notch. 5 ways artificial intelligence is upgrading software engineering Intelligent Edge Analytics: 7 ways machine learning is driving edge computing adoption in 2018 15 millions jobs in Britain at stake with Artificial Intelligence robots set to replace humans at workforce
Read more
  • 0
  • 0
  • 9065

article-image-2018-year-of-graph-databases
Amey Varangaonkar
04 May 2018
5 min read
Save for later

2018 is the year of graph databases. Here's why.

Amey Varangaonkar
04 May 2018
5 min read
With the explosion of data, businesses are looking to innovate as they connect their operations to a whole host of different technologies. The need for consistency across all data elements is now stronger than ever. That’s where graph databases come in handy. Because they allow for a high level of flexibility when it comes to representing your data and also while handling complex interactions within different elements, graph databases are considered by many to be the next big trend in databases. In this article, we dive deep into the current graph database scene, and list out 3 top reasons why graph databases will continue to soar in terms of popularity in 2018. What are graph databases, anyway? Simply put, graph databases are databases that follow the graph model. What is a graph model, then? In mathematical terms, a graph is simply a collection of nodes, with different nodes connected by edges. Each node contains some information about the graph, while edges denote the connection between the nodes. How are graph databases different from the relational databases, you might ask? Well, the key difference between the two is the fact that graph data models allow for more flexible and fine-grained relationships between data objects, as compared to relational models. There are some more differences between the graph data model and the relational data model, which you should read through for more information. Often, you will see that graph databases are without a schema. This allows for a very flexible data model, much like the document or key/value store database models. A unique feature of the graph databases, however, is that they also support relationships between the data objects like a relational database. This is useful because it allows for a more flexible and faster database, which can be invaluable to your project which demands a quicker response time. Image courtesy DB-Engines The rise in popularity of the graph database models over the last 5 years has been stunning, but not exactly surprising. If we were to drill down the 3 key factors that have propelled the popularity of graph databases to a whole new level, what would they be? Let’s find out. Major players entering the graph database market About a decade ago, the graph database family included just Neo4j and a couple of other less-popular graph databases. More recently, however, all the major players in the industry such as Oracle (Oracle Spatial and Graph), Microsoft (Graph Engine), SAP (SAP Hana as a graph store) and IBM (Compose for JanusGraph) have come up with graph offerings of their own. The most recent entrant to the graph database market is Amazon, with Amazon Neptune announced just last year. According to Andy Jassy, CEO of Amazon Web Services, graph databases are becoming a part of the growing trend of multi-model databases. Per Jassy, these databases are finding increased adoption on the cloud as they support a myriad of useful data processing methods. The traditional over-reliance on relational databases is slowly breaking down, he says. Rise of the Cypher Query Language With graph databases slowly getting mainstream recognition and adoption, the major companies have identified the need for a standard query language for all graph databases. Similar to SQL, Cypher has emerged as a standard and is a widely-adopted alternative to write efficient and easy to understand graph queries. As of today, the Cypher Query Language is used in popular graph databases such as Neo4j, SAP Hana, Redis graph and so on. The OpenCypher project, the project that develops and maintains Cypher, has also released Cypher for popular Big Data frameworks like Apache Spark. Cypher’s popularity has risen tremendously over the last few years. The primary reason for this is the fact that like SQL, Cypher’s declarative nature allows users to state the actions they want performed on their graph data without explicitly specifying them. Finding critical real-world applications Graph databases were in the news as early as 2016, when the Panama paper leaks were revealed with the help of Neo4j and Linkurious, a data visualization software. In more recent times, graph databases have also found increased applications in online recommendation engines, as well as for performing tasks that include fraud detection and managing social media. Facebook’s search app also uses graph technology to map social relationships. Graph databases are also finding applications in virtual assistants to drive conversations - eBay’s virtual shopping assistant is an example. Even NASA uses the knowledge graph architecture to find critical data. What next for graph databases? With growing adoption of graph databases, we expect graph-based platforms to soon become the foundational elements of many corporate tech stacks. The next focus area for these databases will be practical implementations such as graph analytics and building graph-based applications. The rising number of graph databases would also mean more competition, and that is a good thing - competition will bring more innovation, and enable incorporation of more cutting-edge features. With a healthy and steadily growing community of developers, data scientists and even business analysts, this evolution may be on the cards, sooner than we might expect. Amazon Neptune: A graph database service for your applications When, why and how to use Graph analytics for your big data
Read more
  • 0
  • 0
  • 9044

article-image-why-do-it-teams-need-to-transition-from-devops-to-devsecops
Guest Contributor
13 Jul 2019
8 min read
Save for later

Why do IT teams need to transition from DevOps to DevSecOps?

Guest Contributor
13 Jul 2019
8 min read
Does your team perform security testing during development? If not, why not? Cybercrime is on the rise, and formjacking, ransomware, and IoT attacks have increased alarmingly in the last year. This makes security a priority at every stage of development. In this kind of ominous environment, development teams around the globe should take a more proactive approach to threat detection. This can be done in a number of ways. There are some basic techniques that development teams can use to protect their development environments. But ultimately, what is needed is an integration of threat identification and management into the development process itself. Integrated processes like this are referred to as DevSecOps, and in this guide, we’ll take you through some of the advantages of transitioning to DevSecOps. Protect Your Development Environment First, though, let’s look at some basic measures that can help to protect your development environment. For both individuals and enterprises, online privacy is perhaps the most valuable currency of all. Proxy servers, Tor, and virtual private networks (VPN) have slowly crept into the lexicon of internet users as cost-effective privacy tools to consider if you want to avoid drawing the attention of hackers. But what about enterprises? Should they use the same tools? They would prefer to avoid hackers as well. This answer is more complicated. Encryption and authentication should be addressed early in the development process, especially given the common practice of using open source libraries for app coding. The advanced security protocols that power many popular consumer VPN services make it a good first step to protecting coding and any proprietary technology. Additional controls like using 2-factor authentication and limiting who has access will further protect the development environment and procedures. Beyond these basic measures, though, it is also worth looking in detail at your entire development process and integrating security management at every stage. This is sometimes referred to as integrating DevOps and DevSecOps. DevOps vs. DevSecOps: What's the Difference? DevOps and DevSecOps are not separate entities, but different facets of the development process. Traditionally, DevOps teams work to integrate software development and implementation in order to facilitate the rapid delivery of new business applications. Since this process omits security testing and solutions, many security flaws and vulnerabilities aren't addressed early enough in the development process. With a new approach, DevSecOps, this omission is addressed by automating security-related tasks and integrating controls and functions like composition analysis and configuration management into the development process. Previously, DevSec focused only on automating security code testing, but it is gradually transitioning to incorporate an operations-centric approach. This helps in reconciling two environments that are opposite by nature. DevOps is forward-looking because it's toward rapid deployment, while development security looks backward to analyze and predict future issues. By prioritizing security analysis and automation, teams can still improve delivery speed without the need to retroactively find and deal with threats. Best Practices: How DevSecOps Should Work The goal of current DevSecOps best practices is to implement a shift towards real-time threat detection rather than undergoing a historical analysis. This enables more efficient application development that recognizes and deals with issues as they happen rather than waiting until there's a problem. This can be done by developing a more effective strategy while adopting DevSecOps practices. When all areas of concern are addressed, it results in: Automatic code procurement: Automatic code procurement eliminates the problem of human error and incorporating weak or flawed coding. This benefits developers by allowing vulnerabilities and flaws to be discovered and corrected earlier in the process. Uninterrupted security deployment: Uninterrupted security deployment through the use of automation tools that work in real time. This is done by creating a closed-loop testing and reporting and real-time threat resolution. Leveraged security resources: Leveraged security resources through automation. Using automated DevSecOps typically address areas related to threat assessment, event monitoring, and code security. This frees your IT or security team to focus in other areas, like threat remediation and elimination. There are five areas that need to be addressed in order for DevSecOps to be effective: Code analysis By delivering code in smaller modules, teams are able to identify and address vulnerabilities faster. Management changes Adapting the protocol for changes in management or admins allows users to improve on changes faster as well as enabling security teams to analyze their impact in real time. This eliminates the problem of getting calls about problems with system access after the application is deployed. Compliance Addressing compliance with Payment Card Industry Digital Security Standard (PCI DSS) and the new General Data Protection Regulations (GDPR) earlier, helps prevent audits and heavy fines. It also ensures that you have all of your reporting ready to go in the event of a compliance audit. Automating threat and vulnerability detection Threats evolve and proliferate fast, so security should be agile enough to deal with emerging threats each time coding is updated or altered. Automating threat detection earlier in the development process improves response times considerably. Training programs Comprehensive security response begins with proper IT security training. Developers should craft a training protocol that ensures all personnel who are responsible for security are up to date and on the same page. Organizations should bring security and IT staff into the process sooner. That means advising current team members of current procedures and ensuring that all new staff is thoroughly trained. Finding the Right Tools for DevSecOps Success Does a doctor operate with a chainsaw? Hopefully not. Likewise, all of the above points are nearly impossible to achieve without the right tools to get the job done with precision. What should your DevSec team keep in their toolbox? Automation tools Automation tools provide scripted remediation recommendations for security threats detected. One such tool is Automate DAST, which scans new or modified code against security vulnerabilities listed on the Open Web Application Security Project's (OWASP) list of the most common flaws, such as a SQL injection errors. These are flaws you might have missed during static analysis of your application code. Attack modeling tools Attack modeling tools create models of possible attack matrices and map their implications. There are plenty of attack modeling tools available, but a good one for identifying cloud vulnerabilities is Infection Monkey, which simulates attacks against the parts of your infrastructure that run on major public cloud hosts like Google Cloud, AWS, and Azure, as well as most cloud storage providers like Dropbox and pCloud. Visualization tools Visualization tools are used for evolving, identifying, and sharing findings with the operations team. An example of this type of tool is PortVis, developed by a team led by professor Kwan-Liu Ma at the University of California, Davis. PortVis is designed to display activity by host or port in three different modes: a grid visualization, in which all network activity is displayed on a single grid; a volume visualization, which extends the grid to a three-dimensional volume; and a port visualization, which allows devs to visualize the activity on specific ports over time. Using this tool, different types of attack can be easily distinguished from each other. Alerting tools  Alerting tools prioritize threats and send alerts so that the most hazardous vulnerabilities can be addressed immediately. WhiteSource Bolt, for instance, is a useful tool of this type, designed to improve the security of open source components. It does this by checking these components against known security threats, and providing security alerts to devs. These alerts also auto-generate issues within GitHub. Here, devs can see details such as references for the CVE, its CVSS rating, a suggested fix, and there is even an option to assign the vulnerability to another team member using the milestones feature. The Bottom Line Combining DevOps and DevSec is not a meshing of two separate disciplines, but rather the natural transition of development to a more comprehensive approach that takes security into account earlier in the process, and does it in a more meaningful way. This saves a lot of time and hassles by addressing enterprise security requirements before deployment rather than probing for flaws later. The sooner your team hops on board with DevSecOps, the better. Author Bio Gary Stevens is a front-end developer. He’s a full-time blockchain geek and a volunteer working for the Ethereum foundation as well as an active Github contributor. Is DevOps really that different from Agile? No, says Viktor Farcic [Podcast] Does it make sense to talk about DevOps engineers or DevOps tools? How Visual Studio Code can help bridge the gap between full-stack development and DevOps
Read more
  • 0
  • 0
  • 8961
article-image-polyglot-persistence-what-is-it-and-why-does-it-matter
Richard Gall
21 Jul 2018
3 min read
Save for later

Polyglot persistence: what is it and why does it matter?

Richard Gall
21 Jul 2018
3 min read
Polyglot persistence is a way of storing data. It's an approach that acknowledges that often there is no one size fits all solution to data storage. From the types of data you're trying to store to your application architecture, polyglot persistence is a hybrid solution to data management. Think of polyglot programming. If polyglot programming is about using a variety of languages according to the context in which your working, polyglot persistence is applying that principle to database architecture. For example, storing transactional data in Hadoop files is possible, but makes little sense. On the other hand, processing petabytes of Internet logs using a Relational Database Management System (RDBMS) would also be ill-advised. These tools were designed to tackle specific types of tasks; even though they can be co-opted to solve other problems, the cost of adapting the tools to do so would be enormous. It is a virtual equivalent of trying to fit a square peg in a round hole. Polyglot persistence: an example For example, consider a company that sells musical instruments and accessories online (and in a network of shops). At a high-level, there are a number of problems that a company needs to solve to be successful: Attract customers to its stores (both virtual and physical). Present them with relevant products (you would not try to sell a drum kit to a pianist, would you?!). Once they decide to buy, process the payment and organize shipping. To solve these problems a company might choose from a number of available technologies that were designed to solve these problems: Store all the products in a document-based database such as MongoDB, Cassandra, DynamoDB, or DocumentDB. There are multiple advantages of document databases: flexible schema, sharding (breaking bigger databases into a set of smaller, more manageable ones), high availability, and replication, among others. Model the recommendations using a graph-based database (such as Neo4j, Tinkerpop/Gremlin, or GraphFrames for Spark): such databases reflect the factual and abstract relationships between customers and their preferences. Mining such a graph is invaluable and can produce a more tailored offering for a customer. For searching, a company might use a search-tailored solution such as Apache Solr or ElasticSearch. Such a solution provides fast, indexed text searching capabilities. Once a product is sold, the transaction normally has a well-structured schema (such as product name, price, and so on.) To store such data (and later process and report on it) relational databases are best suited. With polyglot persistence, a company always chooses the right tool for the right job instead of trying to coerce a single technology into solving all of its problems. Read next: How to optimize Hbase for the Cloud [Tutorial] The trouble with Smart Contracts Indexing, Replicating, and Sharding in MongoDB [Tutorial]
Read more
  • 0
  • 0
  • 8947

article-image-julia-for-machine-learning-will-the-new-language-pick-up-pace
Prasad Ramesh
20 Oct 2018
4 min read
Save for later

Julia for machine learning. Will the new language pick up pace?

Prasad Ramesh
20 Oct 2018
4 min read
Machine learning can be done using many languages, with Python and R being the most popular. But one language has been overlooked for some time—Julia. Why isn’t Julia machine learning a thing? Julia isn't an obvious choice for machine learning simply because it's a new language that has only recently hit version 1.0. While Python is well-established, with a large community and many libraries, Julia simply doesn't have the community to shout about it. And that's a shame. Right now Julia is used in various fields. From optimizing milk production in dairy farms to parallel supercomputing for astronomy, Julia has a wide range of applications. A common theme here is that these actions all require numerical, scientific, and sometimes parallel computation. Julia is well-suited to the sort of tasks where intensive computation is essential. Viral Shah, CEO of Julia Computing said to Forbes “Amazon, Apple, Disney, Facebook, Ford, Google, Grindr, IBM, Microsoft, NASA, Oracle and Uber are other Julia users, partners and organizations hiring Julia programmers.” Clearly, Julia is powering the analytical nous of some of the most high profile organizations on the planet. Perhaps it just needs more cheerleading to go truly mainstream. Why Julia is a great language for machine learning Julia was originally designed for high-performance numerical analysis. This means that everything that has gone into its design is built for the very things you need to do to build effective machine learning systems. Speed and functionality Julia combines the functionality from various popular languages like Python, R, Matlab, SAS and Stata with the speed of C++ and Java. A lot of the standard LaTeX symbols can be used in Julia, with the syntax usually being the same as LaTeX. This mathematical syntax makes it easy for implementing mathematical formulae in code and make Julia machine learning possible. It also has in-built support for parallelism which allows utilization of multiple cores at once making it fast at computations. Julia’s loops and functions features are pretty fast, fast enough that you would probably notice significant performance differences against other languages. The performance can be almost comparable to C with very little code actually used. With packages like ArrayFire, generic code can be run on GPUs. In Julia, the multiple dispatch feature is very useful for defining number and array-like datatypes. Matrices, data tables work with good compatibility and performance. Julia has automatic garbage collection, a collection of libraries for mathematical calculations, linear algebra, random number generation, and regular expression matching. Libraries and scalability Julia machine learning can be done with powerful tools like MLBase.jl, Flux.jl, Knet.jl, that can be used for machine learning and artificial intelligence systems. It also has a scikit-learn implementation called ScikitLearn.jl. Although ScikitLearn.jl is not an official port, it is a useful additional tool for building machine learning systems with Julia. As if all those weren’t enough, Julia also has TensorFlow.jl and MXNet.jl. So, if you already have experience with these tools, in other implementations, the transition is a little easier than learning everything from scratch. Julia is also incredibly scalable. It can be deployed on large clusters quickly, which is vital if you’re working with big data across a distributed system. Should you consider Julia machine learning? Because it’s fast and possesses a great range of features, Julia could potentially overtake both Python and R to be the choice of language for machine learning in the future. Okay, maybe we shouldn’t get ahead of ourselves. But with Julia reaching the 1.0 milestone, and the language rising on the TIOBE index, you certainly shouldn’t rule out Julia when it comes to machine learning. Julia is also available to use in the popular tool Jupyter Notebook, paving a path for wider adoption. A note of caution, however, is important. Rather than simply dropping everything for Julia, it will be worth monitoring the growth of the language. Over the next 12 to 24 months we’ll likely see new projects and libraries, and the Julia machine learning community expanding. If you start hearing more noise about the language, it becomes a much safer option to invest your time and energy in learning it. If you are just starting off with machine learning, then you should stick to other popular languages. An experienced engineer, however, who already has a good grip on other languages shouldn’t be scared of experimenting with Julia - it gives you another option, and might just help you to uncover new ways of working and solving problems. Julia 1.0 has just been released What makes functional programming a viable choice for artificial intelligence projects? Best Machine Learning Datasets for beginners
Read more
  • 0
  • 0
  • 8937