Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Tech Guides - Data

281 Articles
article-image-5-examples-of-artificial-intelligence-in-web-apps
Sugandha Lahoti
20 Aug 2018
7 min read
Save for later

5 examples of Artificial Intelligence in Web apps

Sugandha Lahoti
20 Aug 2018
7 min read
Modern day web app development is increasingly focused on building a customer-facing front-end presence with the use of Artificial Intelligence. Web apps, use Artificial Intelligence not just for intelligent automation, but also for building recommendation engines, website implementation, and image recognition, among other application areas. In this post, we look at five key areas, illustrated by real-world examples, where web apps are employing Artificial intelligence to automate some part of their system. Recommendation Engines of Amazon and Netflix Curating content based on the user’s context is one of the most widely used AI features in web apps. Amazon, for instance, uses item-based collaborative filtering for product classification. Amazon’s recommendation system uses a combination of goods-based recommendation (users are recommended for those similar to what they liked in the past) and buddy-based recommendation (users are recommended things which their Facebook friends like.) Not just for their recommendation system, Amazon has been using AI for multiple tasks. Their AI Management Strategy is called The Flywheel, where one part of Amazon acts as a catalyst for AI and machine learning growth in other areas. Read more: Four interesting Amazon patents in 2018 that use machine learning, AR, and robotics Another popular example is Netflix, who revamped their recommendation algorithm based on visual impressions. One of their research projects indicated that the artwork was not only the biggest influencer to a viewer's decision to watch content, but it also drew over 82% of their focus while browsing Netflix. This made them develop a new image recommendation algorithm which works in real time to project the image it thinks the user will respond to. They use implicit (user behavior) and Explicit data (user activity) and then feed this data to machine learning algorithms to figure out the relevant content for each user. For each title, users get the image with the highest rank based on their profile. Side by side, it continues collecting data from its 100 million other subscribers to improve its engine’s performance. Read more: What software stack does Netflix use? Google and Microsoft using Image recognition Image recognition can serve multiple uses for web apps including object and pattern recognition, locating duplicates (exact or partial), image search by fragments, and more. Two such unique applications of image recognition are Google’s Quickdraw and Microsoft’s Captionbot.ai. Quick Draw is Google’s AI-powered web app game, where users have to draw an everyday object that a neural network tries to recognize. Players are given 20 seconds to draw a random item, and Google’s neural network tries to match it with other 50 million hand-drawn sketches by other players to identify the correct one. Quickdraw aims to generate the world’s largest doodling data set, which is shared publicly to help further machine learning research. The data preserves user privacy by collecting only anonymous metadata, including timestamp, country code, whether or not the drawing was recognized, and which word the drawing corresponded to. This dataset was used in SketchRNN, a neural network that can draw words and interpolate between drawings. Another image recognition web app is Microsoft’s Captionbot.ai. The system can automatically generate a caption for an uploaded photograph. Users can rate how accurately it has detected what was on display. The algorithm learns from the rating, to make the captions more accurate. It uses three separate services to process the images. The Computer Vision API identifies the components of the photo, then mixes it with data from the Bing Image API, and runs any faces it spots through Emotion API. The Emotion API analyses facial expressions to detect anger, contempt, disgust, fear, and other traits. Based on the results from these APIs, the caption is generated. Google Docs powered by Natural Language Processing Modern Web apps can also be fueled with cognitive capabilities to make them stand apart from other apps. Instances of this include transforming human speech to text or conversing with people in natural language. One such example of a web app which includes natural language processing is Google Docs. Google Docs and Slides have an Explore feature to show text, images, and other features relevant to the document that a user is working on at any given point.  Docs can also use natural language to search through data and reports, and automatically generate formulas in Sheets. Google Docs recently incorporated an AI grammar checker, announced at Google Cloud Next. It uses a machine translation algorithm to recognize errors and suggest corrections as users type. Google Docs can also be integrated with Natural Language API to recognize the sentiment of selected text in a Google Doc and highlight it based on that sentiment. Web-based artificial intelligence Chatbots Web-based chatbots are just like app-based chatbots albeit they interact with users in the website browser. They use AI techniques such as natural language understanding and pattern recognition to store and distinguish between the context of the information provided and elicit a suitable response for future replies. An example of web-based chatbots are the Live Chat bots where the conversation with a visitor on a website is automated using a chatbot. Many live chat software companies are already experimenting with chatbots. Examples include the Operator bot used by Intercom, a company building customer messaging platform or Driftbot by Drift which gives your website a personal assistant. Read More: Top 4 chatbot development frameworks for developers Another example, are AI based chatbots which help in creating full websites. Right Click is a startup that introduced an A.I.-powered chatbot which uses Artificial Intelligence in a conversational interface to create websites. It asks general questions during the conversation like “What industry you belong to?” and “Why do you want to make a website?” and creates customized templates as per the given answers. Similarly, Wix’s Artificial Intelligence Design bot can tailor websites by learning about each person’s or business’ own needs. Web-based code helpers using AI Intelligent coding assistants are gaining popularity with their ability to understand the code and provide right suggestions at the right time. They can analyze code on the web and give fast and smart completions. Codota for Chrome is a smart web-based IDE which can build predictive models of code and suggest code completions and related content based on the current context present in the code. It combines program analysis, natural language processing, and machine learning to learn from the code. Users can look for Codota’s Icon on every code snippet on their browsers - in GitHub, StackOverflow and others. Another example is Deep Cognition’s Deep Learning Studio – Cloud. It is not exactly an IDE, but it features AI-powered drag & drop interface to help design deep learning models with ease. It features assisted modeling, for automated tensor size calculations and real-time validation. It also has AutoML feature to automatically build a neural network. [dropcap]E[/dropcap]ven though AI is a great choice to enhance your web apps, an important facet to keep in mind is ensuring fairness, accuracy, and transparency of your web apps. For instance, web apps powered by natural language should not discriminate people based on caste, color, or creed or hurt user sentiments. Similarly, those using neural networks for recognizing images should ensure the filtering of obscene images. Creating such types of artificial intelligence systems would require a hybrid of designers, programmers, ML engineers, and researchers. This collective group will have a good grasp of user experience, will be comfortable thinking in abstracts and algorithms, and equally well versed with the social impacts of artificial intelligence. Read More: 20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017 Uber introduces Fusion.js, a plugin-based web development framework for high-performance apps. Electron Fiddle: A ‘code playground’ for experimenting with cross-platform native apps. Warp: Rust’s new web framework for implementing WAI (Web Application Interface)
Read more
  • 0
  • 0
  • 27376

article-image-top-10-computer-vision-tools
Aaron Lazar
05 Apr 2018
7 min read
Save for later

Top 10 Tools for Computer Vision

Aaron Lazar
05 Apr 2018
7 min read
The adoption of Computer Vision has been steadily picking up pace over the past decade, but there’s been a spike in adoption of various computer vision tools in recent times, thanks to its implementation in fields like IoT, manufacturing, healthcare, security, etc. Computer vision tools have evolved over the years, so much so that computer vision is now also being offered as a service. Moreover, the advancements in hardware like GPUs, as well as machine learning tools and frameworks make computer vision much more powerful in the present day. Major cloud service providers like Google, Microsoft and AWS have all joined the race towards being the developers’ choice. But which tool should you choose? Today I’ll take you through a list of the top tools and will help you understand which one to pick up, based on your need. Computer Vision Tools/Libraries OpenCV: Any post on computer vision is incomplete without the mention of OpenCV. OpenCV is a great performing computer vision tool and it works well with C++ as well as Python. OpenCV is prebuilt with all the necessary techniques and algorithms to perform several image and video processing tasks. It’s quite easy to use and this makes it clearly the most popular computer vision library on the planet! It is multi-platform, allowing you to build applications for Linux, Windows and Android. At the same time, it does have some drawbacks. It gets a bit slow when working through massive data sets or very large images. Moreover, on its own, it doesn’t have GPU support and relies on CUDA for GPU processing. Matlab: Matlab is a great tool for creating image processing applications and is widely used in research. The reason being that Matlab allows quick prototyping. Another interesting aspect is that Matlab code is quite concise, as compared to C++, making it easier to read and debug. It tackles errors before execution by proposing some ways to make the code faster. On the downside, Matlab is a paid tool. Also, it can get quite slow during execution time, if that’s something that concerns you much. Matlab is not your go to tool in an actual production environment, as it was basically built for prototyping and research. AForge.NET/Accord.NET: You’ll be excited to know that image processing is possible even if you’re a C# and .NET developer, thanks to AForge/Accord. It’s a great tool that has a lot of filters and is great for image manipulation and different transforms. The Image Processing Lab allows for filtering capabilities like edge detection and more. AForge is extremely simple to use as all you need to do is adjust parameters from a user interface. Moreover, its processing speeds are quite good. However, AForge doesn’t possess the power and capabilities of other tools like OpenCV, like advanced motion picture analysis or even advanced processing on images. TensorFlow: TensorFlow has been gaining popularity over the past couple of years, owing to its power and ease of use. It lets you bring the power of Deep Learning to computer vision and has some great tools to perform image processing/classification - it’s API-like graph tensor. Moreover, you can make use of the Python API to perform face and expression detection. You can also perform classification using techniques like regression. Tensorflow also allows you to perform computer vision of tremendous magnitudes. One of the main drawbacks of Tensorflow is that it’s extremely resource hungry and can devour a GPU’s capabilities in no time, quite uncalled for. Moreover, if you wanted to learn how to perform image processing with TensorFlow, you’d have to understand what Machine and Deep Learning is, write your own algorithms and then go forward from there. CUDA: CUDA is a platform for parallel computing, invented by NVIDIA. It enables great boosts in computing performance by leveraging the power of GPUs. The CUDA Toolkit includes the NVIDIA Performance Primitives library which is a collection of signal, image, and video processing functions. If you have large images to process, that are GPU intensive, you can choose to use CUDA. CUDA is easy to program and is quite efficient and fast. On the downside, it is extremely high on power consumption and you will find yourself reformulating for memory distribution in parallel tasks. SimpleCV: SimpleCV is a framework for building computer vision applications. It gives you access to a multitude of computer vision tools on the likes of OpenCV, pygame, etc. If you don’t want to get into the depths of image processing and just want to get your work done, this is the tool to get your hands on. If you want to do some quick prototyping, SimpleCV will serve you best. Although, if your intention is to use it in heavy production environments, you cannot expect it to perform on the level of OpenCV. Moreover, the community forum is not very active and you might find yourself running into walls, especially with the installation. GPUImage: GPUImage is a framework or rather, an iOS library that allows you to apply GPU-accelerated effects and filters to images, live motion video, and movies. It is built on OpenGL ES 2.0. Running custom filters on a GPU calls for a lot of code to set up and maintain. GPUImage cuts down on all of that boilerplate and gets the job done for you. Computer Vision as a Service: Google Cloud and Mobile Vision APIs: Google Cloud Vision API enables developers to perform image processing by encapsulating powerful machine learning models in a simple REST API that can be called in an application. Also, its Optical Character Recognition (OCR) functionality enables you to detect text in your images. The Mobile Vision API lets you detect objects in photos and video, using real-time on-device vision technology. It also lets you scan and recognise barcodes and text. Amazon Rekognition: Amazon Rekognition is a deep learning-based image and video analysis service that makes adding image and video analysis to your applications, a piece of cake. The service can identify objects, text, people, scenes and activities, and it can also detect inappropriate content, apart from providing highly accurate facial analysis and facial recognition for sentiment analysis. Microsoft Azure Computer Vision API: Microsoft’s API is quite similar to its peers and allows you to analyse images, read text in them, and analyse video in near-real time. You can also flag adult content, generate thumbnails of images and recognise handwriting. Bonus: SciPy and NumPy: I thought I’d add these in as well, since I’ve seen quite a few developers use Python to build computer vision applications (without OpenCV, that is). SciPy and NumPy are quite powerful enough to perform image processing. scikit-image is a Python package that is dedicated towards image processing, which uses native NumPy and SciPy arrays as image objects. Moreover, you get to use the cool IPython interactive computing environment and you can also choose to include OpenCV if you want to do some more hardcore image processing. Well there you have it, these were the top tools for computer vision and image processing. Head on over and check out these resources, to get working with some of the top tools used in the industry.
Read more
  • 0
  • 2
  • 25839

article-image-what-is-pytorch-and-how-does-it-work
Sunith Shetty
18 Sep 2018
7 min read
Save for later

What is PyTorch and how does it work?

Sunith Shetty
18 Sep 2018
7 min read
PyTorch is a Python-based scientific computing package that uses the power of graphics processing units. It is also one of the preferred deep learning research platforms built to provide maximum flexibility and speed. It is known for providing two of the most high-level features; namely, tensor computations with strong GPU acceleration support and building deep neural networks on a tape-based autograd systems. There are many existing Python libraries which have the potential to change how deep learning and artificial intelligence are performed, and this is one such library. One of the key reasons behind PyTorch’s success is it is completely Pythonic and one can build neural network models effortlessly. It is still a young player when compared to its other competitors, however, it is gaining momentum fast. A brief history of PyTorch Since its release in January 2016, many researchers have continued to increasingly adopt PyTorch. It has quickly become a go-to library because of its ease in building extremely complex neural networks. It is giving a tough competition to TensorFlow especially when used for research work. However, there is still some time before it is adopted by the masses due to its still “new” and “under construction” tags. PyTorch creators envisioned this library to be highly imperative which can allow them to run all the numerical computations quickly. This is an ideal methodology which fits perfectly with the Python programming style. It has allowed deep learning scientists, machine learning developers, and neural network debuggers to run and test part of the code in real time. Thus they don’t have to wait for the entire code to be executed to check whether it works or not. You can always use your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch functionalities and services when required. Now you might ask, why PyTorch? What’ so special in using it to build deep learning models? The answer is quite simple, PyTorch is a dynamic library (very flexible and you can use as per your requirements and changes) which is currently adopted by many of the researchers, students, and artificial intelligence developers. In the recent Kaggle competition, PyTorch library was used by nearly all of the top 10 finishers. Some of the key highlights of PyTorch includes: Simple Interface: It offers easy to use API, thus it is very simple to operate and run like Python. Pythonic in nature: This library, being Pythonic, smoothly integrates with the Python data science stack. Thus it can leverage all the services and functionalities offered by the Python environment. Computational graphs: In addition to this, PyTorch provides an excellent platform which offers dynamic computational graphs, thus you can change them during runtime. This is highly useful when you have no idea how much memory will be required for creating a neural network model. PyTorch Community PyTorch community is growing in numbers on a daily basis. In the just short year and a half, it has shown some great amount of developments that have led to its citations in many research papers and groups. More and more people are bringing PyTorch within their artificial intelligence research labs to provide quality driven deep learning models. The interesting fact is, PyTorch is still in early-release beta, but the way everyone is adopting this deep learning framework at a brisk pace shows its real potential and power in the community. Even though it is in the beta release, there are 741 contributors on the official GitHub repository working on enhancing and providing improvements to the existing PyTorch functionalities. PyTorch doesn’t limit to specific applications because of its flexibility and modular design. It has seen heavy use by leading tech giants such as Facebook, Twitter, NVIDIA, Uber and more in multiple research domains such as NLP, machine translation, image recognition, neural networks, and other key areas. Why use PyTorch in research? Anyone who is working in the field of deep learning and artificial intelligence has likely worked with TensorFlow before, Google’s most popular open source library. However, the latest deep learning framework - PyTorch solves major problems in terms of research work. Arguably PyTorch is TensorFlow’s biggest competitor to date, and it is currently a much favored deep learning and artificial intelligence library in the research community. Dynamic Computational graphs It avoids static graphs that are used in frameworks such as TensorFlow, thus allowing the developers and researchers to change how the network behaves on the fly. The early adopters are preferring PyTorch because it is more intuitive to learn when compared to TensorFlow. Different back-end support PyTorch uses different backends for CPU, GPU and for various functional features rather than using a single back-end. It uses tensor backend TH for CPU and THC for GPU. While neural network backends such as THNN and THCUNN for CPU and GPU respectively. Using separate backends makes it very easy to deploy PyTorch on constrained systems. Imperative style PyTorch library is specially designed to be intuitive and easy to use. When you execute a line of code, it gets executed thus allowing you to perform real-time tracking of how your neural network models are built. Because of its excellent imperative architecture and fast and lean approach it has increased overall PyTorch adoption in the community. Highly extensible PyTorch is deeply integrated with the C++ code, and it shares some C++ backend with the deep learning framework, Torch. Thus allowing users to program in C/C++ by using an extension API based on cFFI for Python and compiled for CPU for GPU operation. This feature has extended the PyTorch usage for new and experimental use cases thus making them a preferable choice for research use. Python-Approach PyTorch is a native Python package by design. Its functionalities are built as Python classes, hence all its code can seamlessly integrate with Python packages and modules. Similar to NumPy, this Python-based library enables GPU-accelerated tensor computations plus provides rich options of APIs for neural network applications. PyTorch provides a complete end-to-end research framework which comes with the most common building blocks for carrying out everyday deep learning research. It allows chaining of high-level neural network modules because it supports Keras-like API in its torch.nn package. PyTorch 1.0: The path from research to production We have been discussing all the strengths PyTorch offers, and how these make it a go-to library for research work. However, one of the biggest downsides is, it has been its poor production support. But this is expected to change soon. PyTorch 1.0 is expected to be a major release which will overcome the challenges developers face in production. This new iteration of the framework will merge Python-based PyTorch with Caffe2 allowing machine learning developers and deep learning researchers to move from research to production in a hassle-free way without the need to deal with any migration challenges. The new version 1.0 will unify research and production capabilities in one framework thus providing the required flexibility and performance optimization for research and production. This new version promises to handle tasks one has to deal with while running the deep learning models efficiently on a massive scale. Along with the production support, PyTorch 1.0 will have more usability and optimization improvements. With PyTorch 1.0, your existing code will continue to work as-is, there won’t be any changes to the existing API. If you want to stay updated with all the progress to PyTorch library, you can visit the Pull Requests page. The beta release of this long-awaited version is expected later this year. Major vendors like Microsoft and Amazon are expected to provide complete support to the framework across their cloud products. Summing up, PyTorch is a compelling player in the field of deep learning and artificial intelligence libraries, exploiting its unique niche of being a research-first library. It overcomes all the challenges and provides the necessary performance to get the job done. If you’re a mathematician, researcher, student who is inclined to learn how deep learning is performed, PyTorch is an excellent choice as your first deep learning framework to learn. Read more Can a production-ready Pytorch 1.0 give TensorFlow a tough time? A new geometric deep learning extension library for Pytorch releases! Top 5 tools for reinforcement learning
Read more
  • 0
  • 0
  • 25820

article-image-write-python-code-or-pythonic-code
Aaron Lazar
08 Aug 2018
5 min read
Save for later

Do you write Python Code or Pythonic Code?

Aaron Lazar
08 Aug 2018
5 min read
If you’re new to Programming, and Python in particular, you might have heard the term Pythonic being brought up at tech conferences, meetups and even at your own office. You might have also wondered why the term and whether they’re just talking about writing Python code. Here we’re going to understand what the term Pythonic means and why you should be interested in learning how to not just write Python code, rather write Pythonic code. What does Pythonic mean? When people talk about pythonic code, they mean that the code uses Python idioms well, that it’s natural or displays fluency in the language. In other words, it means the most widely adopted idioms that are adopted by the Python community. If someone said you are writing un-pythonic code, they might actually mean that you are attempting to write Java/C++ code in Python, disregarding the Python idioms and performing a rough transcription rather than an idiomatic translation from the other language. Okay, now that you have a theoretical idea of what Pythonic (and unpythonic) means, let’s have a look at some Pythonic code in practice. Writing Pythonic Code Before we get into some examples, you might be wondering if there’s a defined way/method of writing Pythonic code. Well, there is, and it’s called PEP 8. It’s the official style guide for Python. Example #1 x=[1, 2, 3, 4, 5, 6] result = [] for idx in range(len(x)); result.append(x[idx] * 2) result Output: [2, 4, 6, 8, 10, 12] Consider the above code, where you’re trying to multiply some elements, “x” by 2. So, what we did here was, we created an empty list to store the results. We would then append the solution of the computation into the result. The result now contains a function which is 2 multiplied by each of the elements. Now, if you were to write the same code in a Pythonic way, you might want to simply use list comprehensions. Here’s how: x=[1, 2, 3, 4, 5, 6] [(element * 2) for element in x] Output: [2, 4, 6, 8, 10] You might have noticed, we skipped the entire for loop! Example #2 Let’s make the previous example a bit more complex, and place a condition that the elements should be multiplied by 2 only if they are even. x=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] result = [] for idx in range(len(x)); if x[idx] % 2 == 0; result.append(x[idx] * 2) else; result.append(x[idx]) result Output: [1, 4, 3, 8, 5, 12, 7, 16, 9, 20] We’ve actually created an if else statement to solve this problem, but there is a simpler way of doing things the Pythonic way. [(element * 2 if element % 2 == 0 else element) for element in x] Output: [1, 4, 3, 8, 5, 12, 7, 16, 9, 20] If you notice what we’ve done here, apart from skipping multiple lines of code, is that we used the if-else statement in the same sentence. Now, if you wanted to perform filtering, you could do this: x=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [element * 2 for element in x if element % 2 == 0] Output: [4, 8, 12, 16, 20] What we’ve done here is put the if statement after the for declaration, and Voila! We’ve achieved filtering. If you’re using a nice IDE like Jupyter Notebooks or PyCharm, they will help you format your code as per the PEP 8 suggestions. Why should you write Pythonic code? Well firstly, you’re saving loads of time writing humongous piles of cowdung code, so you’re obviously becoming a smarter and more productive programmer. Python is a pretty slow language, and when you’re trying to do something in Python, which is acquired from another language like Java or C++, you’re going to worsen things. With idiomatic, Pythonic code, you’re improving the speed of your programs. Moreover, idiomatic code is far easier to comprehend and understand for other developers who are working on the same code. It helps a great deal when you’re trying to refactor someone else’s code. Fearing Pythonic idioms Well, I don’t mean the idioms themselves are scary. Rather, quite a few developers and organisations have begun discriminating on the basis of whether someone can or cannot write Pythonic code. This is wrong, because, at the end of the day, though the PEP 8 exists, the idea of the term Pythonic is different for different people. To some it might mean picking up a new style guide and improving the way you code. To others, it might mean being succinct and not repeating themselves. It’s time we stopped judging people on whether they can or can’t write Pythonic code and instead, we should appreciate when someone is able to present readable, easily maintainable and succinct code. If you find them writing a bit of clumsy code, you can choose to talk to them about improving their design considerations. And the world will be a better place! If you’re interested in learning how to write more succinct and concise Python code, check out these resources: Learning Python Design Patterns - Second Edition Python Design Patterns [Video] Python Tips, Tricks and Techniques [Video]    
Read more
  • 0
  • 2
  • 21755

article-image-most-commonly-used-java-machine-learning-libraries
Fatema Patrawala
10 Sep 2018
15 min read
Save for later

6 most commonly used Java Machine learning libraries

Fatema Patrawala
10 Sep 2018
15 min read
There are over 70 Java-based open source machine learning projects listed on the MLOSS.org website and probably many more unlisted projects live at university servers, GitHub, or Bitbucket. In this article, we will review the major machine learning libraries and platforms in Java, the kind of problems they can solve, the algorithms they support, and the kind of data they can work with. This article is an excerpt taken from Machine learning in Java, written by Bostjan Kaluza and published by Packt Publishing Ltd. Weka Weka, which is short for Waikato Environment for Knowledge Analysis, is a machine learning library developed at the University of Waikato, New Zealand, and is probably the most well-known Java library. It is a general-purpose library that is able to solve a wide variety of machine learning tasks, such as classification, regression, and clustering. It features a rich graphical user interface, command-line interface, and Java API. You can check out Weka at http://www.cs.waikato.ac.nz/ml/weka/. At the time of writing this book, Weka contains 267 algorithms in total: data pre-processing (82), attribute selection (33), classification and regression (133), clustering (12), and association rules mining (7). Graphical interfaces are well-suited for exploring your data, while Java API allows you to develop new machine learning schemes and use the algorithms in your applications. Weka is distributed under GNU General Public License (GNU GPL), which means that you can copy, distribute, and modify it as long as you track changes in source files and keep it under GNU GPL. You can even distribute it commercially, but you must disclose the source code or obtain a commercial license. In addition to several supported file formats, Weka features its own default data format, ARFF, to describe data by attribute-data pairs. It consists of two parts. The first part contains header, which specifies all the attributes (that is, features) and their type; for instance, nominal, numeric, date, and string. The second part contains data, where each line corresponds to an instance. The last attribute in the header is implicitly considered as the target variable, missing data are marked with a question mark. For example, the Bob instance written in an ARFF file format would be as follows: @RELATION person_dataset @ATTRIBUTE `Name`  STRING @ATTRIBUTE `Height`  NUMERIC @ATTRIBUTE `Eye color`{blue, brown, green} @ATTRIBUTE `Hobbies`  STRING @DATA 'Bob', 185.0, blue, 'climbing, sky diving' 'Anna', 163.0, brown, 'reading' 'Jane', 168.0, ?, ? The file consists of three sections. The first section starts with the @RELATION <String> keyword, specifying the dataset name. The next section starts with the @ATTRIBUTE keyword, followed by the attribute name and type. The available types are STRING, NUMERIC, DATE, and a set of categorical values. The last attribute is implicitly assumed to be the target variable that we want to predict. The last section starts with the @DATA keyword, followed by one instance per line. Instance values are separated by comma and must follow the same order as attributes in the second section. Weka's Java API is organized in the following top-level packages: weka.associations: These are data structures and algorithms for association rules learning, including Apriori, predictive apriori, FilteredAssociator, FP-Growth, Generalized Sequential Patterns (GSP), Hotspot, and Tertius. weka.classifiers: These are supervised learning algorithms, evaluators, and data structures. Thepackage is further split into the following components: weka.classifiers.bayes: This implements Bayesian methods, including naive Bayes, Bayes net, Bayesian logistic regression, and so on weka.classifiers.evaluation: These are supervised evaluation algorithms for nominal and numerical prediction, such as evaluation statistics, confusion matrix, ROC curve, and so on weka.classifiers.functions: These are regression algorithms, including linear regression, isotonic regression, Gaussian processes, support vector machine, multilayer perceptron, voted perceptron, and others weka.classifiers.lazy: These are instance-based algorithms such as k-nearest neighbors, K*, and lazy Bayesian rules weka.classifiers.meta: These are supervised learning meta-algorithms, including AdaBoost, bagging, additive regression, random committee, and so on weka.classifiers.mi: These are multiple-instance learning algorithms, such as citation k-nn, diverse density, MI AdaBoost, and others weka.classifiers.rules: These are decision tables and decision rules based on the separate-and-conquer approach, Ripper, Part, Prism, and so on weka.classifiers.trees: These are various decision trees algorithms, including ID3, C4.5, M5, functional tree, logistic tree, random forest, and so on weka.clusterers: These are clustering algorithms, including k-means, Clope, Cobweb, DBSCAN hierarchical clustering, and farthest. weka.core: These are various utility classes, data presentations, configuration files, and so on. weka.datagenerators: These are data generators for classification, regression, and clustering algorithms. weka.estimators: These are various data distribution estimators for discrete/nominal domains, conditional probability estimations, and so on. weka.experiment: These are a set of classes supporting necessary configuration, datasets, model setups, and statistics to run experiments. weka.filters: These are attribute-based and instance-based selection algorithms for both supervised and unsupervised data preprocessing. weka.gui: These are graphical interface implementing explorer, experimenter, and knowledge flowapplications. Explorer allows you to investigate dataset, algorithms, as well as their parameters, and visualize dataset with scatter plots and other visualizations. Experimenter is used to design batches of experiment, but it can only be used for classification and regression problems. Knowledge flows implements a visual drag-and-drop user interface to build data flows, for example, load data, apply filter, build classifier, and evaluate. Java-ML for machine learning Java machine learning library, or Java-ML, is a collection of machine learning algorithms with a common interface for algorithms of the same type. It only features Java API, therefore, it is primarily aimed at software engineers and programmers. Java-ML contains algorithms for data preprocessing, feature selection, classification, and clustering. In addition, it features several Weka bridges to access Weka's algorithms directly through the Java-ML API. It can be downloaded from http://java-ml.sourceforge.net; where, the latest release was in 2012 (at the time of writing this book). Java-ML is also a general-purpose machine learning library. Compared to Weka, it offers more consistent interfaces and implementations of recent algorithms that are not present in other packages, such as an extensive set of state-of-the-art similarity measures and feature-selection techniques, for example, dynamic time warping, random forest attribute evaluation, and so on. Java-ML is also available under the GNU GPL license. Java-ML supports any type of file as long as it contains one data sample per line and the features are separated by a symbol such as comma, semi-colon, and tab. The library is organized around the following top-level packages: net.sf.javaml.classification: These are classification algorithms, including naive Bayes, random forests, bagging, self-organizing maps, k-nearest neighbors, and so on net.sf.javaml.clustering: These are clustering algorithms such as k-means, self-organizing maps, spatial clustering, Cobweb, AQBC, and others net.sf.javaml.core: These are classes representing instances and datasets net.sf.javaml.distance: These are algorithms that measure instance distance and similarity, for example, Chebyshev distance, cosine distance/similarity, Euclidian distance, Jaccard distance/similarity, Mahalanobis distance, Manhattan distance, Minkowski distance, Pearson correlation coefficient, Spearman's footrule distance, dynamic time wrapping (DTW), and so on net.sf.javaml.featureselection: These are algorithms for feature evaluation, scoring, selection, and ranking, for instance, gain ratio, ReliefF, Kullback-Liebler divergence, symmetrical uncertainty, and so on net.sf.javaml.filter: These are methods for manipulating instances by filtering, removing attributes, setting classes or attribute values, and so on net.sf.javaml.matrix: This implements in-memory or file-based array net.sf.javaml.sampling: This implements sampling algorithms to select a subset of dataset net.sf.javaml.tools: These are utility methods on dataset, instance manipulation, serialization, Weka API interface, and so on net.sf.javaml.utils: These are utility methods for algorithms, for example, statistics, math methods, contingency tables, and others Apache Mahout The Apache Mahout project aims to build a scalable machine learning library. It is built atop scalable, distributed architectures, such as Hadoop, using the MapReduce paradigm, which is an approach for processing and generating large datasets with a parallel, distributed algorithm using a cluster of servers. Mahout features console interface and Java API to scalable algorithms for clustering, classification, and collaborative filtering. It is able to solve three business problems: item recommendation, for example, recommending items such as people who liked this movie also liked…; clustering, for example, of text documents into groups of topically-related documents; and classification, for example, learning which topic to assign to an unlabeled document. Mahout is distributed under a commercially-friendly Apache License, which means that you can use it as long as you keep the Apache license included and display it in your program's copyright notice. Mahout features the following libraries: org.apache.mahout.cf.taste: These are collaborative filtering algorithms based on user-based and item-based collaborative filtering and matrix factorization with ALS org.apache.mahout.classifier: These are in-memory and distributed implementations, includinglogistic regression, naive Bayes, random forest, hidden Markov models (HMM), and multilayer perceptron org.apache.mahout.clustering: These are clustering algorithms such as canopy clustering, k-means, fuzzy k-means, streaming k-means, and spectral clustering org.apache.mahout.common: These are utility methods for algorithms, including distances, MapReduce operations, iterators, and so on org.apache.mahout.driver: This implements a general-purpose driver to run main methods of other classes org.apache.mahout.ep: This is the evolutionary optimization using the recorded-step mutation org.apache.mahout.math: These are various math utility methods and implementations in Hadoop org.apache.mahout.vectorizer: These are classes for data presentation, manipulation, andMapReduce jobs Apache Spark Apache Spark, or simply Spark, is a platform for large-scale data processing builds atop Hadoop, but, in contrast to Mahout, it is not tied to the MapReduce paradigm. Instead, it uses in-memory caches to extract a working set of data, process it, and repeat the query. This is reported to be up to ten times as fast as a Mahout implementation that works directly with disk-stored data. It can be grabbed from https://spark.apache.org. There are many modules built atop Spark, for instance, GraphX for graph processing, Spark Streaming for processing real-time data streams, and MLlib for machine learning library featuring classification, regression, collaborative filtering, clustering, dimensionality reduction, and optimization. Spark's MLlib can use a Hadoop-based data source, for example, Hadoop Distributed File System (HDFS) or HBase, as well as local files. The supported data types include the following: Local vector is stored on a single machine. Dense vectors are presented as an array of double-typed values, for example, (2.0, 0.0, 1.0, 0.0); while sparse vector is presented by the size of the vector, an array of indices, and an array of values, for example, [4, (0, 2), (2.0, 1.0)]. Labeled point is used for supervised learning algorithms and consists of a local vector labeled with a double-typed class values. Label can be class index, binary outcome, or a list of multiple class indices (multiclass classification). For example, a labeled dense vector is presented as [1.0, (2.0, 0.0, 1.0, 0.0)]. Local matrix stores a dense matrix on a single machine. It is defined by matrix dimensions and a single double-array arranged in a column-major order. Distributed matrix operates on data stored in Spark's Resilient Distributed Dataset (RDD), which represents a collection of elements that can be operated on in parallel. There are three presentations: row matrix, where each row is a local vector that can be stored on a single machine, row indices are meaningless; and indexed row matrix, which is similar to row matrix, but the row indices are meaningful, that is, rows can be identified and joins can be executed; and coordinate matrix, which is used when a row cannot be stored on a single machine and the matrix is very sparse. Spark's MLlib API library provides interfaces to various learning algorithms and utilities as outlined in the following list: org.apache.spark.mllib.classification: These are binary and multiclass classification algorithms, including linear SVMs, logistic regression, decision trees, and naive Bayes org.apache.spark.mllib.clustering: These are k-means clustering org.apache.spark.mllib.linalg: These are data presentations, including dense vectors, sparse vectors, and matrices org.apache.spark.mllib.optimization: These are the various optimization algorithms used as low-level primitives in MLlib, including gradient descent, stochastic gradient descent, update schemes for distributed SGD, and limited-memory BFGS org.apache.spark.mllib.recommendation: These are model-based collaborative filtering implemented with alternating least squares matrix factorization org.apache.spark.mllib.regression: These are regression learning algorithms, such as linear least squares, decision trees, Lasso, and Ridge regression org.apache.spark.mllib.stat: These are statistical functions for samples in sparse or dense vector format to compute the mean, variance, minimum, maximum, counts, and nonzero counts org.apache.spark.mllib.tree: This implements classification and regression decision tree-learning algorithms org.apache.spark.mllib.util: These are a collection of methods to load, save, preprocess, generate, and validate the data Deeplearning4j Deeplearning4j, or DL4J, is a deep-learning library written in Java. It features a distributed as well as a single-machinedeep-learning framework that includes and supports various neural network structures such as feedforward neural networks, RBM, convolutional neural nets, deep belief networks, autoencoders, and others. DL4J can solve distinct problems, such as identifying faces, voices, spam or e-commerce fraud. Deeplearning4j is also distributed under Apache 2.0 license and can be downloaded from http://deeplearning4j.org. The library is organized as follows: org.deeplearning4j.base: These are loading classes org.deeplearning4j.berkeley: These are math utility methods org.deeplearning4j.clustering: This is the implementation of k-means clustering org.deeplearning4j.datasets: This is dataset manipulation, including import, creation, iterating, and so on org.deeplearning4j.distributions: These are utility methods for distributions org.deeplearning4j.eval: These are evaluation classes, including the confusion matrix org.deeplearning4j.exceptions: This implements exception handlers org.deeplearning4j.models: These are supervised learning algorithms, including deep belief network, stacked autoencoder, stacked denoising autoencoder, and RBM org.deeplearning4j.nn: These are the implementation of components and algorithms based on neural networks, such as neural network, multi-layer network, convolutional multi-layer network, and so on org.deeplearning4j.optimize: These are neural net optimization algorithms, including back propagation, multi-layer optimization, output layer optimization, and so on org.deeplearning4j.plot: These are various methods for rendering data org.deeplearning4j.rng: This is a random data generator org.deeplearning4j.util: These are helper and utility methods MALLET Machine Learning for Language Toolkit (MALLET), is a large library of natural language processing algorithms and utilities. It can be used in a variety of tasks such as document classification, document clustering, information extraction, and topic modeling. It features command-line interface as well as Java API for several algorithms such as naive Bayes, HMM, Latent Dirichlet topic models, logistic regression, and conditional random fields. MALLET is available under Common Public License 1.0, which means that you can even use it in commercial applications. It can be downloaded from http://mallet.cs.umass.edu. MALLET instance is represented by name, label, data, and source. However, there are two methods to import data into the MALLET format, as shown in the following list: Instance per file: Each file, that is, document, corresponds to an instance and MALLET accepts the directory name for the input. Instance per line: Each line corresponds to an instance, where the following format is assumed: the instance_name label token. Data will be a feature vector, consisting of distinct words that appear as tokens and their occurrence count. The library comprises the following packages: cc.mallet.classify: These are algorithms for training and classifying instances, including AdaBoost, bagging, C4.5, as well as other decision tree models, multivariate logistic regression, naive Bayes, and Winnow2. cc.mallet.cluster: These are unsupervised clustering algorithms, including greedy agglomerative, hill climbing, k-best, and k-means clustering. cc.mallet.extract: This implements tokenizers, document extractors, document viewers, cleaners, and so on. cc.mallet.fst: This implements sequence models, including conditional random fields, HMM, maximum entropy Markov models, and corresponding algorithms and evaluators. cc.mallet.grmm: This implements graphical models and factor graphs such as inference algorithms, learning, and testing. For example, loopy belief propagation, Gibbs sampling, and so on. cc.mallet.optimize: These are optimization algorithms for finding the maximum of a function, such as gradient ascent, limited-memory BFGS, stochastic meta ascent, and so on. cc.mallet.pipe: These are methods as pipelines to process data into MALLET instances. cc.mallet.topics: These are topics modeling algorithms, such as Latent Dirichlet allocation, four-level pachinko allocation, hierarchical PAM, DMRT, and so on. cc.mallet.types: This implements fundamental data types such as dataset, feature vector, instance, and label. cc.mallet.util: These are miscellaneous utility functions such as command-line processing, search, math, test, and so on. To design, build, and deploy your own machine learning applications by leveraging key Java machine learning libraries, check out this book Machine learning in Java, published by Packt Publishing. 5 JavaScript machine learning libraries you need to know A non programmer’s guide to learning Machine learning Why use JavaScript for machine learning?  
Read more
  • 0
  • 0
  • 19441

article-image-libraries-for-geospatial-analysis
Aarthi Kumaraswamy
22 May 2018
12 min read
Save for later

Top 7 libraries for geospatial analysis

Aarthi Kumaraswamy
22 May 2018
12 min read
The term geospatial refers to finding information that is located on the earth's surface. This can include, for example, the position of a cellphone tower, the shape of a road, or the outline of a country. Geospatial data often associates some piece of information with a particular location. Geospatial development is the process of writing computer programs that can access, manipulate, and display this type of information. Internally, geospatial data is represented as a series of coordinates, often in the form of latitude and longitude values. Additional attributes, such as temperature, soil type, height, or the name of a landmark, are also often present. There can be many thousands (or even millions) of data points for a single set of geospatial data. In addition to the prosaic tasks of importing geospatial data from various external file formats and translating data from one projection to another, geospatial data can also be manipulated to solve various interesting problems. Obvious examples include the task of calculating the distance between two points, calculating the length of a road, or finding all data points within a given radius of a selected point. We use libraries to solve all of these problems and more. Today we will look at the major libraries used to process and analyze geospatial data. GDAL/OGR GEOS Shapely Fiona Python Shapefile Library (pyshp) pyproj Rasterio GeoPandas This is an excerpt from the book, Mastering Geospatial Analysis with Python by Paul Crickard, Eric van Rees, and Silas Toms. Geospatial Data Abstraction Library (GDAL) and the OGR Simple Features Library The Geospatial Data Abstraction Library (GDAL)/OGR Simple Features Library combines two separate libraries that are generally downloaded together as a GDAL. This means that installing the GDAL package also gives access to OGR functionality. The reason GDAL is covered first is that other packages were written after GDAL, so chronologically, it comes first. As you will notice, some of the packages covered in this post extend GDAL's functionality or use it under the hood. GDAL was created in the 1990s by Frank Warmerdam and saw its first release in June 2000. Later, the development of GDAL was transferred to the Open Source Geospatial Foundation (OSGeo). Technically, GDAL is a little different than your average Python package as the GDAL package itself was written in C and C++, meaning that in order to be able to use it in Python, you need to compile GDAL and its associated Python bindings. However, using conda and Anaconda makes it relatively easy to get started quickly. Because it was written in C and C++, the online GDAL documentation is written in the C++ version of the libraries. For Python developers, this can be challenging, but many functions are documented and can be consulted with the built-in pydoc utility, or by using the help function within Python. Because of its history, working with GDAL in Python also feels a lot like working in C++ rather than pure Python. For example, a naming convention in OGR is different than Python's since you use uppercase for functions instead of lowercase. These differences explain the choice for some of the other Python libraries such as Rasterio and Shapely, which are also covered in this chapter, that has been written from a Python developer's perspective but offer the same GDAL functionality. GDAL is a massive and widely used data library for raster data. It supports the reading and writing of many raster file formats, with the latest version counting up to 200 different file formats that are supported. Because of this, it is indispensable for geospatial data management and analysis. Used together with other Python libraries, GDAL enables some powerful remote sensing functionalities. It's also an industry standard and is present in commercial and open source GIS software. The OGR library is used to read and write vector-format geospatial data, supporting reading and writing data in many different formats. OGR uses a consistent model to be able to manage many different vector data formats. You can use OGR to do vector reprojection, vector data format conversion, vector attribute data filtering, and more. GDAL/OGR libraries are not only useful for Python programmers but are also used by many GIS vendors and open source projects. The latest GDAL version at the time of writing is 2.2.4, which was released in March 2018. GEOS The Geometry Engine Open Source (GEOS) is the C/C++ port of a subset of the Java Topology Suite (JTS) and selected functions. GEOS aims to contain the complete functionality of JTS in C++. It can be compiled on many platforms, including Python. As you will see later on, the Shapely library uses functions from the GEOS library. In fact, there are many applications using GEOS, including PostGIS and QGIS. GeoDjango, also uses GEOS, as well as GDAL, among other geospatial libraries. GEOS can also be compiled with GDAL, giving OGR all of its capabilities. The JTS is an open source geospatial computational geometry library written in Java. It provides various functionalities, including a geometry model, geometric functions, spatial structures and algorithms, and i/o capabilities. Using GEOS, you have access to the following capabilities—geospatial functions (such as within and contains), geospatial operations (union, intersection, and many more), spatial indexing, Open Geospatial Consortium (OGC) well-known text (WKT) and well-known binary (WKB) input/output, the C and C++ APIs, and thread safety. Shapely Shapely is a Python package for manipulation and analysis of planar features, using functions from the GEOS library (the engine of PostGIS) and a port of the JTS. Shapely is not concerned with data formats or coordinate systems but can be readily integrated with such packages. Shapely only deals with analyzing geometries and offers no capabilities for reading and writing geospatial files. It was developed by Sean Gillies, who was also the person behind Fiona and Rasterio. Shapely supports eight fundamental geometry types that are implemented as a class in the shapely.geometry module—points, multipoints, linestrings, multilinestrings, linearrings, multipolygons, polygons, and geometrycollections. Apart from representing these geometries, Shapely can be used to manipulate and analyze geometries through a number of methods and attributes. Shapely has mainly the same classes and functions as OGR while dealing with geometries. The difference between Shapely and OGR is that Shapely has a more Pythonic and very intuitive interface, is better optimized, and has a well-developed documentation. With Shapely, you're writing pure Python, whereas with GEOS, you're writing C++ in Python. For data munging, a term used for data management and analysis, you're better off writing in pure Python rather than C++, which explains why these libraries were created. For more information on Shapely, consult the documentation. This page also has detailed information on installing Shapely for different platforms and how to build Shapely from the source for compatibility with other modules that depend on GEOS. This refers to the fact that installing Shapely will require you to upgrade NumPy and GEOS if these are already installed. Fiona Fiona is the API of OGR. It can be used for reading and writing data formats. The main reason for using it instead of OGR is that it's closer to Python than OGR as well as more dependable and less error-prone. It makes use of two markup languages, WKT and WKB, for representing spatial information with regards to vector data. As such, it can be combined well with other Python libraries such as Shapely, you would use Fiona for input and output, and Shapely for creating and manipulating geospatial data. While Fiona is Python compatible and our recommendation, users should also be aware of some of the disadvantages. It is more dependable than OGR because it uses Python objects for copying vector data instead of C pointers, which also means that they use more memory, which affects the performance. Python shapefile library (pyshp) The Python shapefile library (pyshp) is a pure Python library and is used to read and write shapefiles. The pyshp library's sole purpose is to work with shapefiles—it only uses the Python standard library. You cannot use it for geometric operations. If you're only working with shapefiles, this one-file-only library is simpler than using GDAL. pyproj The pyproj is a Python package that performs cartographic transformations and geodetic computations. It is a Cython wrapper to provide Python interfaces to PROJ.4 functions, meaning you can access an existing library of C code in Python. PROJ.4 is a projection library that transforms data among many coordinate systems and is also available through GDAL and OGR. The reason that PROJ.4 is still popular and widely used is two-fold: Firstly, because it supports so many different coordinate systems Secondly, because of the routes it provides to do this—Rasterio and GeoPandas, two Python libraries covered next, both use pyproj and thus PROJ.4 functionality under the hood The difference between using PROJ.4 separately instead of using it with a package such as GDAL is that it enables you to re-project individual points, and packages using PROJ.4 do not offer this functionality. The pyproj package offers two classes—the Proj class and the Geod class. The Proj class performs cartographic computations, while the Geod class performs geodetic computations. Rasterio Rasterio is a GDAL and NumPy-based Python library for raster data, written with the Python developer in mind instead of C, using Python language types, protocols, and idioms. Rasterio aims to make GIS data more accessible to Python programmers and helps GIS analysts learn important Python standards. Rasterio relies on concepts of Python rather than GIS. Rasterio is an open source project from the satellite team of Mapbox, a provider of custom online maps for websites and applications. The name of this library should be pronounced as raster-i-o rather than ras-te-rio. Rasterio came into being as a result of a project called the Mapbox Cloudless Atlas, which aimed to create a pretty-looking basemap from satellite imagery. One of the software requirements was to use open source software and a high-level language with handy multi-dimensional array syntax. Although GDAL offers proven algorithms and drivers, developing with GDAL's Python bindings feels a lot like C++. Therefore, Rasterio was designed to be a Python package at the top, with extension modules (using Cython) in the middle, and a GDAL shared library on the bottom. Other requirements for the raster library were being able to read and write NumPy ndarrays to and from data files, use Python types, protocols, and idioms instead of C or C++ to free programmers from having to code in two languages. For georeferencing, Rasterio follows the lead of pyproj. There are a couple of capabilities added on top of reading and writing, one of them being a features module. Reprojection of geospatial data can be done with the rasterio.warp module. Rasterio's project homepage can be found on Github. GeoPandas GeoPandas is a Python library for working with vector data. It is based on the pandas library that is part of the SciPy stack. SciPy is a popular library for data inspection and analysis, but unfortunately, it cannot read spatial data. GeoPandas was created to fill this gap, taking pandas data objects as a starting point. The library also adds functionality from geographical Python packages. GeoPandas offers two data objects—a GeoSeries object that is based on a pandas Series object and a GeoDataFrame, based on a pandas DataFrame object, but adding a geometry column for each row. Both GeoSeries and GeoDataFrame objects can be used for spatial data processing, similar to spatial databases. Read and write functionality is provided for almost every vector data format. Also, because both Series and DataFrame objects are subclasses from pandas data objects, you can use the same properties to select or subset data, for example .loc or .iloc. GeoPandas is a library that employs the capabilities of newer tools, such as Jupyter Notebooks, pretty well, whereas GDAL enables you to interact with data records inside of vector and raster datasets through Python code. GeoPandas takes a more visual approach by loading all records into a GeoDataFrame so that you can see them all together on your screen. The same goes for plotting data. These functionalities were lacking in Python 2 as developers were dependent on IDEs without extensive data visualization capabilities which are now available with Jupyter Notebooks. We've provided an overview of the most important open source packages for processing and analyzing geospatial data. The question then becomes when to use a certain package and why. GDAL, OGR, and GEOS are indispensable for geospatial processing and analyzing, but were not written in Python, and so they require Python binaries for Python developers. Fiona, Shapely, and pyproj were written to solve these problems, as well as the newer Rasterio library. For a more Pythonic approach, these newer packages are preferable to the older C++ packages with Python binaries (although they're used under the hood). Now that you have an idea of what options are available for a certain use case and why one package is preferable over another, here’s something you should always remember. As is often the way in programming, there might be multiple solutions for one particular problem. For example, when dealing with shapefiles, you could use pyshp, GDAL, Shapely, or GeoPandas, depending on your preference and the problem at hand. Introduction to Data Analysis and Libraries 15 Useful Python Libraries to make your Data Science tasks Easier “Pandas is an effective tool to explore and analyze data”: An interview with Theodore Petrou Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data  
Read more
  • 0
  • 0
  • 18619
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-7-ai-tools-mobile-developers-need-to-know
Bhagyashree R
20 Sep 2018
11 min read
Save for later

7 AI tools mobile developers need to know

Bhagyashree R
20 Sep 2018
11 min read
Advancements in artificial intelligence (AI) and machine learning has enabled the evolution of mobile applications that we see today. With AI, apps are now capable of recognizing speech, images, and gestures, and translate voices with extraordinary success rates. With a number of apps hitting the app stores, it is crucial that they stand apart from competitors by meeting the rising standards of consumers. To stay relevant it is important that mobile developers keep up with these advancements in artificial intelligence. As AI and machine learning become increasingly popular, there is a growing selection of tools and software available for developers to build their apps with. These cloud-based and device-based artificial intelligence tools provide developers a way to power their apps with unique features. In this article, we will look at some of these tools and how app developers are using them in their apps. Caffe2 - A flexible deep learning framework Source: Qualcomm Caffe2 is a lightweight, modular, scalable deep learning framework developed by Facebook. It is a successor of Caffe, a project started at the University of California, Berkeley. It is primarily built for production use cases and mobile development and offers developers greater flexibility for building high-performance products. Caffe2 aims to provide an easy way to experiment with deep learning and leverage community contributions of new models and algorithms. It is cross-platform and integrates with Visual Studio, Android Studio, and Xcode for mobile development. Its core C++ libraries provide speed and portability, while its Python and C++ APIs make it easy for you to prototype, train, and deploy your models. It utilizes GPUs when they are available. It is fine-tuned to take full advantage of the NVIDIA GPU deep learning platform. To deliver high performance, Caffe2 uses some of the deep learning SDK libraries by NVIDIA such as cuDNN, cuBLAS, and NCCL. Functionalities Enable automation Image processing Perform object detection Statistical and mathematical operations Supports distributed training enabling quick scaling up or down Applications Facebook is using Caffe2 to help their developers and researchers train large machine learning models and deliver AI on mobile devices. Using Caffe2, they significantly improved the efficiency and quality of machine translation systems. As a result, all machine translation models at Facebook have been transitioned from phrase-based systems to neural models for all languages. OpenCV - Give the power of vision to your apps Source: AndroidPub OpenCV short for Open Source Computer Vision Library is a collection of programming functions for real-time computer vision and machine learning. It has C++, Python, and Java interfaces and supports Windows, Linux, Mac OS, iOS and Android. It also supports the deep learning frameworks TensorFlow and PyTorch. Written natively in C/C++, the library can take advantage of multi-core processing. OpenCV aims to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products. The library consists of more than 2500 optimized algorithms including both classic and state-of-the-art computer vision algorithms. Functionalities These algorithms can be used for the following: To detect and recognize faces Identify objects Classify human actions in videos Track camera movements and moving objects Extract 3D models of objects Produce 3D point clouds from stereo cameras Stitch images together to produce a high-resolution image of an entire scene Find similar images from an image database Applications Plickers is an assessment tool, that lets you poll your class for free, without the need for student devices. It uses OpenCV as its graphics and video SDK. You just have to give each student a card called a paper clicker, and use your iPhone/iPad to scan them to do instant checks-for-understanding, exit tickets, and impromptu polls. Also check out FastCV BoofCV TensorFlow Lite and Mobile - An Open Source Machine Learning Framework for Everyone Source: YouTube TensorFlow is an open source software library for building machine learning models. Its flexible architecture allows easy model deployment across a variety of platforms ranging from desktops to mobile and edge devices. Currently, TensorFlow provides two solutions for deploying machine learning models on mobile devices: TensorFlow Mobile and TensorFlow Lite. TensorFlow Lite is an improved version of TensorFlow Mobile, offering better performance and smaller app size. Additionally, it has very few dependencies as compared to TensorFlow Mobile, so it can be built and hosted on simpler, more constrained device scenarios. TensorFlow Lite also supports hardware acceleration with the Android Neural Networks API. But the catch here is that TensorFlow Lite is currently in developer preview and only has coverage to a limited set of operators. So, to develop production-ready mobile TensorFlow apps, it is recommended to use TensorFlow Mobile. Also, TensorFlow Mobile supports customization to add new operators not supported by TensorFlow Mobile by default, which is a requirement for most of the models of different AI apps. Although TensorFlow Lite is in developer preview, its future releases “will greatly simplify the developer experience of targeting a model for small devices”. It is also likely to replace TensorFlow Mobile, or at least overcome its current limitations. Functionalities Speech recognition Image recognition Object localization Gesture recognition Optical character recognition Translation Text classification Voice synthesis Applications The Alibaba tech team is using TensorFlow Lite to implement and optimize speaker recognition on the client side. This addresses many of the common issues of the server-side model, such as poor network connectivity, extended latency, and poor user experience. Google uses TensorFlow for advanced machine learning models including Google Translate and RankBrain. Core ML - Integrate machine learning in your iOS apps Source: AppleToolBox Core ML is a machine learning framework which can be used to integrate machine learning model in your iOS apps. It supports Vision for image analysis, Natural Language for natural language processing, and GameplayKit for evaluating learned decision trees. Core ML is built on top of the following low-level APIs, providing a simple higher level abstraction to these: Accelerate optimizes large-scale mathematical computations and image calculations for high performance. Basic neural network subroutines (BNNS) provides a collection of functions using which you can implement and run neural networks trained with previously obtained data. Metal Performance Shaders is a collection of highly optimized compute and graphic shaders that are designed to integrate easily and efficiently into your Metal app. To train and deploy custom models you can also use the Create ML framework. It is a machine learning framework in Swift, which can be used to train models using native Apple technologies like Swift, Xcode, and Other Apple frameworks. Functionalities Face and face landmark detection Text detection Barcode recognition Image registration Language and script identification Design games with functional and reusable architecture Applications Lumina is a camera designed in Swift for easily integrating Core ML models - as well as image streaming, QR/Barcode detection, and many other features. ML Kit by Google - Seamlessly build machine learning into your apps Source: Google ML Kit is a cross-platform suite of machine learning tools for its Firebase mobile development platform. It comprises of Google's ML technologies, such as the Google Cloud Vision API, TensorFlow Lite, and the Android Neural Networks API together in a single SDK enabling you to apply ML techniques to your apps easily. You can leverage its ready-to-use APIs for common mobile use cases such as recognizing text, detecting faces, identifying landmarks, scanning barcodes, and labeling images. If these APIs don't cover your machine learning problem, you can use your own existing TensorFlow Lite models. You just have to upload your model on Firebase and ML Kit will take care of the hosting and serving. These APIs can run on-device or in the cloud. Its on-device APIs process your data quickly and work even when there’s no network connection. Its cloud-based APIs leverage the power of Google Cloud Platform's machine learning technology to give you an even higher level of accuracy. Functionalities Automate tedious data entry for credit cards, receipts, and business cards, or help organize photos. Extract text from documents, which you can use to increase accessibility or translate documents. Real-time face detection can be used in applications like video chat or games that respond to the player's expressions. Using image labeling you can add capabilities such as content moderation and automatic metadata generation. Applications A popular calorie counter app, Lose It! uses Google ML Kit Text Recognition API to quickly capture nutrition information to ensure it’s easy to record and extremely accurate. PicsArt uses ML Kit custom model APIs to provide TensorFlow–powered 1000+ effects to enable millions of users to create amazing images with their mobile phones. Dialogflow - Give users new ways to interact with your product Source: Medium Dialogflow is a Natural Language Understanding (NLU) platform that makes it easy for developers to design and integrate conversational user interfaces into mobile apps, web applications, devices, and bots. You can integrate it on Alexa, Cortana, Facebook Messenger, and other platforms your users are on. With Dialogflow you can build interfaces, such as chatbots and conversational IVR that enable natural and rich interactions between your users and your business. It provides this human-like interaction with the help of agents. Agents can understand the vast and varied nuances of human language and translate that to standard and structured meaning that your apps and services can understand. It comes in two types: Dialogflow Standard Edition and Dialogflow Enterprise Edition. Dialogflow Enterprise Edition users have access to Google Cloud Support and a service level agreement (SLA) for production deployments. Functionalities Provide customer support One-click integration on 14+ platforms Supports multilingual responses Improve NLU quality by training with negative examples Debug using more insights and diagnostics Applications Domino’s simplified the process of ordering pizza using Dialogflow’s conversational technology. Domino's leveraged large customer service knowledge and Dialogflow's NLU capabilities to build both simple customer interactions and increasingly complex ordering scenarios. Also check out Wit.ai Rasa NLU Microsoft Cognitive Services - Make your apps see, hear, speak, understand and interpret your user needs Source: Neel Bhatt Cognitive Services is a collection of APIs, SDKs, and services to enable developers easily add cognitive features to their applications such as emotion and video detection, facial, speech, and vision recognition, among others. You need not be an expert in data science to make your systems more intelligent and engaging. The pre-built services come with high-quality RESTful intelligent APIs for the following: Vision: Make your apps identify and analyze content within images and videos. Provides capabilities such as image classification, optical character recognition in images, face detection, person identification, and emotion identification. Speech: Integrate speech processing capabilities into your app or services such as text-to-speech, speech-to-text, speaker recognition, and speech translation. Language: Your application or service will understand the meaning of the unstructured text or the intent behind a speaker's utterances. It comes with capabilities such as text sentiment analysis, key phrase extraction, automated and customizable text translation. Knowledge: Create knowledge-rich resources that can be integrated into apps and services. It provides features such as QnA extraction from unstructured text, knowledge base creation from collections of Q&As, and semantic matching for knowledge bases. Search: Using Search API you can find exactly what you are looking for across billions of web pages. It provides features like ad-free, safe, location-aware web search, Bing visual search, custom search engine creation, and many more. Applications To safeguard against fraud, Uber uses the Face API, part of Microsoft Cognitive Services, to help ensure the driver using the app matches the account on file. Cardinal Blue developed an app called PicCollage, a popular mobile app that allows users to combine photos, videos, captions, stickers, and special effects to create unique collages. Also check out AWS machine learning services IBM Watson These were some of the tools that will help you integrate intelligence into your apps. These libraries make it easier to add capabilities like speech recognition, natural language processing, computer vision, and many others, giving users the wow moment of accomplishing something that wasn’t quite possible before. Along with choosing the right AI tool, you must also consider other factors that can affect your app performance. These factors include the accuracy of your machine learning model, which can be affected by bias and variance, using correct datasets for training, seamless user interaction, and resource-optimization, among others. While building any intelligent app it is also important to keep in mind that the AI in your app is solving a problem and it doesn’t exist because it is cool. Thinking from the user’s perspective will allow you to assess the importance of a particular problem. A great AI app will not just help users do something faster, but enable them to do something they couldn’t do before. With the growing popularity and the need to speed up the development of intelligent apps, many companies ranging from huge tech giants to startups are providing AI solutions. In the future we will definitely see more developer tools coming into the market, making AI in apps a norm. 6 most commonly used Java Machine learning libraries 5 ways artificial intelligence is upgrading software engineering Machine Learning as a Service (MLaaS): How Google Cloud Platform, Microsoft Azure, and AWS are democratizing Artificial Intelligence
Read more
  • 0
  • 0
  • 18553

article-image-earn-1m-per-year-hint-learn-machine-learning
Neil Aitken
01 Aug 2018
10 min read
Save for later

How to earn $1m per year? Hint: Learn machine learning

Neil Aitken
01 Aug 2018
10 min read
Internet job portal ‘Indeed.com’ links potential employers with people who are looking to take the next step in their careers. The proportion of job posts on their site, relating to ‘Data Science’, a specific job in the AI category, is growing fast (see chart below). More broadly, Artificial Intelligence & machine learning skills, of which ‘Data Scientist’ is just one example, are in demand. No wonder that it has been termed as the sexiest job role of the 21st century. Interest comes from an explosion of jobs in the field from big companies and Start-Ups, all of which are competing to come up with the best AI business and to earn the money that comes with software that automates tasks. The skills shortage associated with Artificial Intelligence represents an opportunity for any developer. There has never been a better time to consider whether reskilling or upskilling in AI could be a lucrative path for you. Below : Indeed.com. Proportion of job postings containing Data Scientist or Data Science. [caption id="attachment_21240" align="aligncenter" width="1525"] Artificial Intelligence skills are increasingly in demand and create a real opportunity for those prepared to reskill or upskill.[/caption] Source: Indeed  The AI skills gap the market is experiencing comes from the difficulty associated with finding an individual demonstrating a competent mixture of the very disparate faculties that AI roles require. Artificial Intelligence and it’s near equivalents such as Machine Learning and Neural Networks operate at the intersection of what have mostly been two very different disciplines – statistics and software development. In simple terms, they are half coding, half maths. Hamish Ogilvy, CEO of AI based Internal Search company Sajari is all too familiar with the problem. He’s on the front line, hiring AI developers. “The hardest part”, says Ogilvy, “is that AI is pretty complex and the average developer/engineer does not have the background in maths/stats/science to actually understand what is happening. On the flip side the trouble with the stats/maths/science people is that they typically can't code, so finding people in that sweet spot that have both is pretty tough.” He’s right. The New York Times suggests that the pool of qualified talent is only 10,000 people, worldwide. Those who do have jobs are typically happily ensconced, paid well, treated appropriately and given no reason whatsoever to want to leave. [caption id="attachment_21244" align="aligncenter" width="1920"] Judged by $ investments in the area alone, AI skills are worth developing for those wishing to stay current on technology skills.[/caption] In fact, an instinct to develop AI skills will serve any technology employee well. No One can have escaped the many estimates, from reputable consultancies, suggesting that Automation will replace up to 30% of jobs in the next 10 years. No job is safe. Every industry is touched by AI in some form. Any responsible individual with a view to the management of their own skills could learn ML and AI skills to stay relevant in current times. Even if you don't want to move out of your current job, learning ML will probably help you adapt better in your industry. What is a typical AI job and what will it pay? OpenAI, a world class Artificial Intelligence research laboratory, revealed the salaries of some of its key Data Science employees recently. Those working in the AI field with a specialization can earn $300 to $500k in their first year out of university. Experts in Artificial Intelligence now command salaries of up to $1m. [caption id="attachment_21242" align="aligncenter" width="432"] The New York Times observes AI salaries[/caption] [caption id="attachment_21241" align="aligncenter" width="1121"] The New York Times observes AI salaries[/caption] Source: The New York times Indraneil Roy, an Expert in AI and Talent Acquisition who works for Edge Networks puts it this way when outlining the difficulties of hiring the right skills and to explain why wages in the field are so high. “The challenge is the quality of resources. As demand is high for this skill, we are already seeing candidates with fake experience and work pedigree not up to standards.” The phenomenon is also causing a ‘brain drain’ in Universities. About a third of jobs in the AI field will go to someone with a Ph.D., and all of those are drawn from universities working on the discipline, often lured by the significant pay packages which are available. So, with huge demand and the universities drained, where will future AI employees come from? 3 ways to skill up to become an AI expert (And earn all that money?) There is still not a lot of agreed terminology or even job roles and responsibility in the sector. However, some things are clear. Those wishing to evolve in to the field of AI must understand the conceptual thinking involved, as a starting point, whether that view is found on the job or as part of an informal / formal educational course. Specifically, most jobs in the specialty require a working knowledge of neural networks, data / analytics, predictive analytics, with some basic programming and database skills. There are some great resources available online to train you up. Most, as you’d expect, are available on your smartphone so there really is no excuse for not having a look. 1. Free online course: Machine Learning & Statistics and probability Hamish Ogilvy summed the online education which is available in the area well. There are “so many free courses now on AI from Stanford,” he said, “that people are able to educate themselves and make up for the failings of antiquated university courses. AI is just maths really,” he says “complex models and stats. So that's what people need grounding in to be successful.” Microsoft offer free AI courses for technical professionals: Microsoft’s training materials are second to none. They’re also provided free and provide a shortcut to a credible understanding in an area simply because it comes from a technical behemoth. Importantly, they also have a list of AI services which you can play with, again for free. For example, a Natural Language engine offers a facility for you to submit text from Instant Messaging conversations and establish the sentiment being felt by the writer. Practical experience of the tools, processes and concepts involved will set you apart. See below. [caption id="attachment_21245" align="aligncenter" width="1999"] Check out Microsoft’s free AI training program for developers.[/caption] Google are taking a proactive stance on Machine Learning. They see it’s potential to improve efficiency in every industry and also offer free ML training courses on their site. 2. Take courses on AI/ML Packt’s machine learning courses, books and videos: Packt is working towards a mission to help the world put software to work in new ways, through the delivery of effective learning and information services to IT professionals. It has published over 6,000 books and videos so far, providing IT professionals with the actionable knowledge they need to get the job done - whether that's specific learning on an emerging technology or optimizing key skills in more established tools. You can choose from a variety of Packt’s books, videos and courses for AI/ML. Here’s a list of top ones: Artificial Intelligence by Example [Book] Artificial Intelligence for Big data [Book] Learn Artificial Intelligence with TensorFlow [Video] Introduction to Artificial Intelligence with Java [Video] Advanced Artificial Intelligence Projects with Python [Video] Python Machine learning - Second Edition [Book] Machine Learning with R - Second Edition [Book] Coursera’s machine learning courses Coursera is a company which make training courses, for a variety of subjects, available online. Taken from actual University course content and delivered with tests, videos and training notes, all accessed online, each course is roughly a University Module. Students pick up an ‘up to under-graduate’ level of understanding of the content involved. Coursera’s courses are often cited as merit worthy and are recognizable in the industry. Costs vary but are typically between $2k and $5k per course. 3. Learn by doing Familiarize yourself with relevant frameworks and tools including Tensor Flow, Python and Keras. TensorFlow from Google is the most used open source AI software library. You can use existing code in your experiments and experiment with neural networks in much the same way as you can in Microsoft’s. Python is a programming language written for a big data world. Its proponents will tell you that Python saves developers hundreds of lines of code, allowing you to tie together information and systems faster than ever before. Python is used extensively in ML and AI applications and should be at the top of your study list. Keras, a deep learning library is similarly ubiquitous. It’s a high level Neural Network API designed to allow prototyping of your software as fast as possible. Finally, a lesser known but still valuable resources is the Accord.net. It is one final example of the many  software elements with which you can engage with to train yourself up. Accord Framework.net will expose you to image libraries, natural learning and real time facial recognition. Earn extra points with employers AI has several lighthouse tasks which are proving the potential of the technology in these still early stages. We’ve included a couple of examples, Natural Language processing and image recognition, above. Practical expertise in these areas specifically, image or voice recognition or pattern matching are valued highly by employers. Alternatively, have you patented something? A registered patent in your name is highly prized. Especially something to do with Machine Learning. Both will help you showcase Extra skills / achievements that will help your application.’ The specifics of how to apply for patents differ by country but you can find out more about the overall principles of how to submit an idea here. Passion and engagement in the subject are also, clearly appealing characteristics for potential employers to see in applicants. Participating in competitions like Kaggle, and having a portfolio of projects you can showcase on facilities like GitHub are also well prized. Of all of these suggestions, for those employed, any on the job experience you can get will stand you in the best stead. Indraneil says "Individual candidates need to spend more time doing relevant projects while in employment. Start ups involved in building products and platforms on AI seem to have better talent." The fact that there are not many AI specialists around is a bad sign There is a demand for employees with AI skills and an investment in relevant training may pay you well. Unfortunately, the underlying problem this situation reveals could be far worse than the problems experienced so far. Together, once found, all these AI scientists are going to automate millions of jobs, in every industry, in every country around the world. If Industry, Governments and Universities cannot train enough people to fill the roles being created by an evolving skills market, we may rightly be concerned to worry about how they will deal with retraining all those displaced by AI, for whom there may be no obvious replacement role available. 18 striking AI Trends to watch in 2018 – Part 1 DeepMind, Elon Musk, and others pledge not to build lethal AI Attention designers, Artificial Intelligence can now create realistic virtual textures
Read more
  • 0
  • 0
  • 18209

article-image-tools-for-reinforcement-learning
Pravin Dhandre
21 May 2018
4 min read
Save for later

Top 5 tools for reinforcement learning

Pravin Dhandre
21 May 2018
4 min read
After deep learning, reinforcement Learning (RL), the hottest branch of Artificial Intelligence that is finding speedy adoption in tech-driven companies. Simply put, reinforcement learning is all about algorithms tracking previous actions or behaviour and providing optimized decisions using trial-and-error principle. Read How Reinforcement Learning works to know more. It might sound theoretical but gigantic firms like Google and Uber have tested out this exceptional mechanism and have been highly successful in cutting edge applied robotics fields such as self driving vehicles. Other top giants including Amazon, Facebook and Microsoft have centralized their innovations around deep reinforcement learning across Automotive, Supply Chain, Networking, Finance and Robotics. With such humongous achievement, reinforcement learning libraries has caught the Artificial Intelligence developer communities’ eye and have gained prime interest for training agents and reinforcing the behavior of the trained agents. In fact, researchers believe in the tremendous potential of reinforcement learning to address unsolved real world challenges like material discovery, space exploration, drug discovery etc and build much smarter artificial intelligence solutions. In this article, we will have a look at the most promising open source tools and libraries to start building your reinforcement learning projects on. OpenAI Gym OpenAI Gym, the most popular environment for developing and comparing reinforcement learning models, is completely compatible with high computational libraries like TensorFlow. The Python based rich AI simulation environment offers support for training agents on classic games like Atari as well as for other branches of science like robotics and physics such as Gazebo simulator and MuJoCo simulator. The Gym environment also offers APIs which facilitate feeding observations along with rewards back to agents. OpenAI has also recently released a new platform, Gym Retro made up of 58 varied and specific scenarios from Sonic the Hedgehog, Sonic the Hedgehog 2, and Sonic 3 games. Reinforcement learning enthusiasts and AI game developers can register for this competition. Read: How to build a cartpole game using OpenAI Gym TensorFlow This is an another well-known open-source library by Google followed by more than 95,000 developers everyday in areas of natural language processing, intelligent chatbots, robotics, and more. The TensorFlow community has developed an extended version called TensorLayer providing popular RL modules that can be easily customized and assembled for tackling real-world machine learning challenges. The TensorFlow community allows for the framework development in most popular languages such as Python, C, Java, JavaScript and Go. Google & its TensorFlow team are in the process of coming up with a Swift-compatible version to enable machine learning  on Apple environment. Read How to implement Reinforcement Learning with TensorFlow Keras Keras presents simplicity in implementing neural networks with just a few lines of codes with faster execution. It provides senior developers and principal scientists with a high-level interface to high tensor computation framework, TensorFlow and centralizes on the model architecture. So, if you have any existing RL models written in TensorFlow, just pick the Keras framework and you can transfer the learning to the related machine learning problem. DeepMind Lab DeepMind Lab is a Google 3D platform with customization for agent-based AI research. It is utilized to understand how self-sufficient artificial agents learn complicated tasks in large, partially observed environments. With the victory of its AlphaGo program against go players, in early 2016, DeepMind captured the public’s attention. With its three hubs spread across London, Canada and France, the DeepMind team is focussing on core AI fundamentals which includes building a single AI system backed by state-of-the-art methods and distributional reinforcement learning. To know more about how DeepMind Lab works, read How Google’s DeepMind is creating images with artificial intelligence. Pytorch Pytorch, open sourced by Facebook, is another well-known deep learning library adopted by many reinforcement learning researchers. It was recent preferred almost unanimously by top 10 finishers in Kaggle competition. With dynamic neural networks and strong GPU acceleration, Rl practitioners use it extensively to conduct experiments on implementing policy-based agent and to create new adventures. One crazy research project is Playing GridWorld, where Pytorch unchained its capabilities with renowned RL algorithms like policy gradient and simplified Actor-Critic method. Summing It Up There you have it, the top tools and libraries for reinforcement learning. The list doesn't end here, as there is a lot of work happening in developing platforms and libraries for scaling reinforcement learning. Frameworks like RL4J, RLlib are already in development and very soon would be full-fledged available for developers to simulate their models in their preferred coding language.
Read more
  • 0
  • 0
  • 18094

article-image-5-ways-artificial-intelligence-is-upgrading-software-engineering
Melisha Dsouza
02 Sep 2018
8 min read
Save for later

5 ways artificial intelligence is upgrading software engineering

Melisha Dsouza
02 Sep 2018
8 min read
47% of digitally mature organizations, or those that have advanced digital practices, said they have a defined AI strategy (Source: Adobe). It is estimated that  AI-enabled tools alone will generate $2.9 trillion in business value by 2021.  80% of enterprises are smartly investing in AI. The stats speak for themselves. AI clearly follows the motto “go big or go home”. This explosive growth of AI in different sectors of technology is also beginning to show its colors in software development. Shawn Drost, co-founder and lead instructor of coding boot camp ‘Hack Reactor’ says that AI still has a long way to go and is only impacting the workflow of a small portion of software engineers on a minority of projects right now. AI promises to change how organizations will conduct business and to make applications smarter. It is only logical then that software development, i.e., the way we build apps, will be impacted by AI as well. Forrester Research recently surveyed 25 application development and delivery (AD&D) teams, and respondents said AI will improve planning, development and especially testing. We can expect better software created under traditional environments. 5 areas of Software Engineering AI will transform The 5 major spheres of software development-  Software design, Software testing, GUI testing, strategic decision making, and automated code generation- are all areas where AI can help. A majority of interest in applying AI to software development is already seen in automated testing and bug detection tools. Next in line are the software design precepts, decision-making strategies, and finally automating software deployment pipelines. Let's take an in-depth look into the areas of high and medium interest of software engineering impacted by AI according to the Forrester Research report.     Source: Forbes.com #1 Software design In software engineering, planning a project and designing it from scratch need designers to apply their specialized learning and experience to come up with alternative solutions before settling on a definite solution. A designer begins with a vision of the solution, and after that retracts and forwards investigating plan changes until they reach the desired solution. Settling on the correct plan choices for each stage is a tedious and mistake-prone action for designers. Along this line, a few AI developments have demonstrated the advantages of enhancing traditional methods with intelligent specialists. The catch here is that the operator behaves like an individual partner to the client. This associate should have the capacity to offer opportune direction on the most proficient method to do design projects. For instance, take the example of AIDA- The Artificial Intelligence Design Assistant, deployed by Bookmark (a website building platform). Using AI, AIDA understands a users needs and desires and uses this knowledge to create an appropriate website for the user. It makes selections from millions of combinations to create a website style, focus, image and more that are customized for the user. In about 2 minutes, AIDA designs the first version of the website, and from that point it becomes a drag and drop operation. You can get a detailed overview of this tool on designshack. #2 Software testing Applications interact with each other through countless  APIs. They leverage legacy systems and grow in complexity everyday. Increase in complexity also leads to its fair share of challenges that can be overcome by machine-based intelligence. AI tools can be used to create test information, explore information authenticity, advancement and examination of the scope and also for test management. Artificial intelligence, trained right, can ensure the testing performed is error free. Testers freed from repetitive manual tests thus have more time to create new automated software tests with sophisticated features. Also, if software tests are repeated every time source code is modified, repeating those tests can be not only time-consuming but extremely costly. AI comes to the rescue once again by automating the testing for you! With AI automated testing, one can increase the overall scope of tests leading to an overall improvement of software quality. Take, for instance, the Functionize tool. It enables users to test fast and release faster with AI enabled cloud testing. The users just have to type a test plan in English and it will be automatically get converted into a functional test case. The tool allows one to elastically scale functional, load, and performance tests across every browser and device in the cloud. It also includes Self-healing tests that update autonomously in real-time. SapFix is another AI Hybrid tool deployed by Facebook which can automatically generate fixes for specific bugs identified by 'Sapienz'. It then proposes these fixes to engineers for approval and deployment to production.   #3 GUI testing Graphical User Interfaces (GUI) have become important in interacting with today's software. They are increasingly being used in critical systems and testing them is necessary to avert failures. With very few tools and techniques available to aid in the testing process, testing GUIs is difficult. Currently used GUI testing methods are ad hoc. They require the test designer to perform humongous tasks like manually developing test cases, identifying the conditions to check during test execution, determining when to check these conditions, and finally evaluate whether the GUI software is adequately tested. Phew! Now that is a lot of work. Also, not forgetting that if the GUI is modified after being tested, the test designer must change the test suite and perform re-testing. As a result, GUI testing today is resource intensive and it is difficult to determine if the testing is adequate. Applitools is a GUI tester tool empowered by AI. The Applitools Eyes SDK automatically tests whether visual code is functioning properly or not. Applitools enables users to test their visual code just as thoroughly as their functional UI code to ensure that the visual look of the application is as you expect it to be. Users can test how their application looks in multiple screen layouts to ensure that they all fit the design. It allows users to keep track of both the web page behaviour, as well as the look of the webpage. Users can test everything they develop from the functional behavior of their application to its visual look. #4 Using Artificial Intelligence in Strategic Decision-Making Normally, developers have to go through a long process to decide what features to include in a product. However, machine learning AI solution trained on business factors and past development projects can analyze the performance of existing applications and help both teams of engineers and business stakeholders like project managers to find solutions to maximize impact and cut risk. Normally, the transformation of business requirements into technology specifications requires a significant timeline for planning. Machine learning can help software development companies to speed up the process, deliver the product in lesser time, and increase revenue within a short span. AI canvas is a well known tool for Strategic Decision making.The canvas helps identify the key questions and feasibility challenges associated with building and deploying machine learning models in the enterprise. The AI Canvas is a simple tool that helps enterprises organize what they need to know into seven categories, namely- Prediction, Judgement, Action, Outcome, Input, Training and feedback. Clarifying these seven factors for each critical decision throughout the organization will help in identifying opportunities for AIs to either reduce costs or enhance performance.   #5 Automatic Code generation/Intelligent Programming Assistants Coding a huge project from scratch is often labour intensive and time consuming. An Intelligent AI programming assistant will reduce the workload by a great extent. To combat the issues of time and money constraints, researchers have tried to build systems that can write code before, but the problem is that these methods aren’t that good with ambiguity. Hence, a lot of details are needed about what the target program aims at doing, and writing down these details can be as much work as just writing the code. With AI, the story can be flipped. ”‘Bayou’- an A.I. based application is an Intelligent programming assistant. It began as an initiative aimed at extracting knowledge from online source code repositories like GitHub. Users can try it out at askbayou.com. Bayou follows a method called neural sketch learning. It trains an artificial neural network to recognize high-level patterns in hundreds of thousands of Java programs. It does this by creating a “sketch” for each program it reads and then associates this sketch with the “intent” that lies behind the program. This DARPA initiative aims at making programming easier and less error prone. Sounds intriguing? Now that you know how this tool works, why not try it for yourself on i-programmer.info. Summing it all up Software engineering has seen massive transformation over the past few years. AI and software intelligence tools aim to make software development easier and more reliable. According to a Forrester Research report on AI's impact on software development, automated testing and bug detection tools use AI the most to improve software development. It will be interesting to see the future developments in software engineering empowered with AI. I’m expecting faster, more efficient, more effective, and less costly software development cycles while engineers and other development personnel focus on bettering their skills to make advanced use of AI in their processes. Implementing Software Engineering Best Practices and Techniques with Apache Maven Intelligent Edge Analytics: 7 ways machine learning is driving edge computing adoption in 2018 15 millions jobs in Britain at stake with AI robots set to replace humans at workforce
Read more
  • 0
  • 0
  • 17781
article-image-5-types-of-deep-transfer-learning
Bhagyashree R
25 Nov 2018
5 min read
Save for later

5 types of deep transfer learning

Bhagyashree R
25 Nov 2018
5 min read
Transfer learning is a method of reusing a model or knowledge for another related task. Transfer learning is sometimes also considered as an extension of existing ML algorithms. Extensive research and work is being done in the context of transfer learning and on understanding how knowledge can be transferred among tasks. However, the Neural Information Processing Systems (NIPS) 1995 workshop Learning to Learn: Knowledge Consolidation and Transfer in Inductive Systems is believed to have provided the initial motivations for research in this field. The literature on transfer learning has gone through a lot of iterations, and the terms associated with it have been used loosely and often interchangeably. Hence, it is sometimes confusing to differentiate between transfer learning, domain adaptation, and multitask learning. Rest assured, these are all related and try to solve similar problems. In this article, we will look into the five types of deep transfer learning to get more clarity on how these differ from each other. [box type="shadow" align="" class="" width=""]This article is an excerpt from a book written by Dipanjan Sarkar, Raghav Bali, and Tamoghna Ghosh titled Hands-On Transfer Learning with Python. This book covers deep learning and transfer learning in detail. It also focuses on real-world examples and research problems using TensorFlow, Keras, and the Python ecosystem with hands-on examples.[/box] #1 Domain adaptation Domain adaptation is usually referred to in scenarios where the marginal probabilities between the source and target domains are different, such as P (Xs) ≠ P (Xt). There is an inherent shift or drift in the data distribution of the source and target domains that requires tweaks to transfer the learning. For instance, a corpus of movie reviews labeled as positive or negative would be different from a corpus of product-review sentiments. A classifier trained on movie-review sentiment would see a different distribution if utilized to classify product reviews. Thus, domain adaptation techniques are utilized in transfer learning in these scenarios. #2 Domain confusion Different layers in a deep learning network capture different sets of features. We can utilize this fact to learn domain-invariant features and improve their transferability across domains. Instead of allowing the model to learn any representation, we nudge the representations of both domains to be as similar as possible. This can be achieved by applying certain preprocessing steps directly to the representations themselves. Some of these have been discussed by Baochen Sun, Jiashi Feng, and Kate Saenko in their paper Return of Frustratingly Easy Domain Adaptation. This nudge toward the similarity of representation has also been presented by Ganin et. al. in their paper, Domain-Adversarial Training of Neural Networks. The basic idea behind this technique is to add another objective to the source model to encourage similarity by confusing the domain itself, hence domain confusion. #3 Multitask learning Multitask learning is a slightly different flavor of the transfer learning world. In the case of multitask learning, several tasks are learned simultaneously without distinction between the source and targets. In this case, the learner receives information about multiple tasks at once, as compared to transfer learning, where the learner initially has no idea about the target task. This is depicted in the following diagram: Multitask learning: Learner receives information from all tasks simultaneously #4 One-shot learning Deep learning systems are data hungry by nature, such that they need many training examples to learn the weights. This is one of the limiting aspects of deep neural networks, though such is not the case with human learning. For instance, once a child is shown what an apple looks like, they can easily identify a different variety of apple (with one or a few training examples); this is not the case with ML and deep learning algorithms. One-shot learning is a variant of transfer learning where we try to infer the required output based on just one or a few training examples. This is essentially helpful in real-world scenarios where it is not possible to have labeled data for every possible class (if it is a classification task) and in scenarios where new classes can be added often. The landmark paper by Fei-Fei and their co-authors, One Shot Learning of Object Categories, is supposedly what coined the term one-shot learning and the research in this subfield. This paper presented a variation on a Bayesian framework for representation learning for object categorization. This approach has since been improved upon, and applied using deep learning systems. #5 Zero-shot learning Zero-shot learning is another extreme variant of transfer learning, which relies on no labeled examples to learn a task. This might sound unbelievable, especially when learning using examples is what most supervised learning algorithms are about. Zero-data learning, or zero-short learning, methods make clever adjustments during the training stage itself to exploit additional information to understand unseen data. In their book on Deep Learning, Goodfellow and their co-authors present zero-shot learning as a scenario where three variables are learned, such as the traditional input variable, x, the traditional output variable, y, and the additional random variable that describes the task, T. The model is thus trained to learn the conditional probability distribution of P(y | x, T). Zero-shot learning comes in handy in scenarios such as machine translation, where we may not even have labels in the target language. In this article we learned about the five types of deep transfer learning types: Domain adaptation, domain confusion, multitask learning, one-shot learning, and zero-shot learning. If you found this post useful, do check out the book, Hands-On Transfer Learning with Python, which covers deep learning and transfer learning in detail. It also focuses on real-world examples and research problems using TensorFlow, Keras, and the Python ecosystem with hands-on examples. CMU students propose a competitive reinforcement learning approach based on A3C using visual transfer between Atari games What is Meta Learning? Is the machine learning process similar to how humans learn?
Read more
  • 0
  • 0
  • 17566

article-image-top-5-programming-languages-big-data
Amey Varangaonkar
04 Apr 2018
8 min read
Save for later

Top 5 programming languages for crunching Big Data effectively

Amey Varangaonkar
04 Apr 2018
8 min read
One of the most important decisions that Big Data professionals have to make, especially the ones who are new to the scene or are just starting out, is choosing the best programming languages for big data manipulation and analysis. Understanding the Big Data problem and framing the architecture to solve it is not quite enough these days - the execution needs to be perfect as well, and choosing the right language goes a long way. The best languages for big data In this article, we look at the 5 of the most popularly used - not to mention highly effective - programming languages for developing Big Data solutions. Scala A beautiful crossover of the object-oriented and functional programming paradigms, Scala is fast and robust, and a popular choice of language for many Big Data professionals.The fact that two of the most popular Big Data processing frameworks in Apache Spark and Apache Kafka have been built on top of Scala tells you everything you need to know about the power of Scala. Scala runs on the JVM, which means the codes written in Scala can be easily used within a Java-based Big Data ecosystem. One significant factor that differentiates Scala from Java, though, is that Scala is a lot less verbose in comparison. You can write 100s of lines of confusing-looking Java code in less than 15 lines in Scala. One negative aspect of Scala, though, is its steep learning curve when compared to languages like Go and Python, and this may put off beginners looking to use it. Why use Scala for big data? Fast and robust Suitable for working with Big Data tools like Apache Spark for distributed Big Data processing JVM compliant, can be used in a Java-based ecosystem Python Python has been declared as one of the fastest growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Its general-purpose nature means it can be used across a broad spectrum of use-cases, and Big Data programming is one major area of application. Many libraries for data analysis and manipulation which are increasingly being used in a Big Data framework to clean and manipulate large chunks of data, such as pandas, NumPy, SciPy - are all Python-based. Not just that, most popular machine learning and deep learning frameworks such as scikit-learn, Tensorflow and many more, are also written in Python and are finding increasing application within the Big Data ecosystem. One drawback of using Python, and a reason why it is not a first-class citizen when it comes to Big Data programming yet, is that it’s slow. Although very easy to use, Big Data professionals have found systems built with languages such as Java or Scala faster and more robust to use than the systems built with Python. However, Python makes up for this limitation with other qualities. As Python is primarily a scripting language, interactive coding and development of analytical solutions for Big Data becomes very easy. Python can integrate effortlessly with the existing Big Data frameworks such as Apache Hadoop and Apache Spark, allowing you to perform predictive analytics at scale without any problem. Why use Python for big data? General-purpose Rich libraries for data analysis and machine learning Easy to use Supports iterative development Rich integration with Big Data tools Interactive computing through Jupyter notebooks R It won’t come as a surprise to many that those who love statistics, love R. The ‘language of statistics’ as it is popularly called as, R is used to build data models which can be used for effective and accurate data analysis. Powered by a large repository of R packages (CRAN, also called as Comprehensive R Archive Network), with R you have just about every type of tool to accomplish any task in Big Data processing - right from analysis to data visualization. R can be integrated seamlessly with Apache Hadoop and Apache Spark, among other popular frameworks, for Big Data processing and analytics. One issue with using R as a programming language for Big Data is that it is not very general-purpose. It means the code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java. That said, if your goal is to only build statistical models for Big Data analytics, R is an option you should definitely consider. Why use R for big data? Built for data science Support for Hadoop and Spark Strong statistical modeling and visualization capabilities Support for Jupyter notebooks Java Last, but not the least, there’s always the good old Java. Some of the traditional Big Data frameworks such as Apache Hadoop and all the tools within its ecosystem are all Java-based, and still in use today in many enterprises. Not to mention the fact that Java is the most stable and production-ready language among all the languages we have discussed so far! Using Java to develop your Big Data applications gives you the ability to use a large ecosystem of tools and libraries for interoperability, monitoring and much more, most of which have already been tried and tested. One major drawback of Java is its verbosity. The fact that you have to write hundreds of lines of codes in Java for a task which can written in barely 15-20 lines of code in Python or Scala, can turnoff many budding programmers. However, the introduction of lambda functions in Java 8 does make life quite easier. Java also does not support iterative development unlike newer languages like Python, and this is an area of focus for the future Java releases. Despite the flaws, Java remains a strong contender when it comes to the preferred language for Big Data programming because of its history and the continued reliance on the traditional Big Data tools and frameworks. Why use Java for big data? Traditional Big Data tools and frameworks are written in Java Stable and production-ready Large ecosystem of tried and tested tools and libraries Go Last but not the least, there’s Go - one of the fastest rising programming languages in recent times. Designed by a group of Google engineers who were frustrated with C++, we think Go is a good shout in this list - simply because of the fact that it powers so many tools used in the Big Data infrastructure, including Kubernetes, Docker and many more. Go is fast, easy to learn, and fairly easy to develop applications with, not to mention deploy them. More importantly, as businesses look at building data analysis systems that can operate at scale, Go-based systems are being used to integrate machine learning and parallel processing of data. It is also possible to interface other languages with Go-based systems with relative ease. Why use Go for big data? Fast, easy to use Many tools used in the Big Data infrastructure are Go-based Efficient distributed computing There are a few other languages you might want to consider - Julia, SAS and MATLAB being some major ones which are useful in their own right. However, when compared to the languages we talked about above, we thought they fell a bit short in some aspects - be it speed, efficiency, ease of use, documentation, or community support, among other things. Let’s take a quick look at the comparison table of all the languages we discussed above. Note that we have used the ✓ symbol for the best possible language/s to help you make an informed decision. This is just our view, and that’s not to say that the other languages are any worse! Scala Python R Java Go Speed ✓ ✓ ✓ Ease of use ✓ ✓ ✓ Quick Learning curve ✓ ✓ Data Analysis capability ✓ ✓ ✓ General-purpose ✓ ✓ ✓ ✓ Big Data support ✓ ✓ ✓ ✓ ✓ Interfacing with other languages ✓ ✓ ✓ Production-ready ✓ ✓ ✓ So...which language should you choose? To answer the question in short - it all depends on the use-case you want to develop. If your focus is hardcore data analysis which involves a lot of statistical computing, R would be your go-to language. On the other hand, if you want to develop streaming applications for your Big Data, Scala can be a preferable choice. If you wish to use Machine Learning to leverage your Big Data and build predictive models, Python will come to your rescue. Lastly, if you plan to build Big Data solutions using just the traditionally-available tools, Java is the language for you. You also have the option of combining the power of two languages to get a more efficient and powerful solution. For example, you can train your machine learning model in Python and deploy it on Spark in a distributed mode. Ultimately, it all depends on how efficiently your solution can function, and more importantly, how fast and accurate it is. Which language do you prefer for crunching your Big Data? Do let us know!
Read more
  • 0
  • 1
  • 17276

article-image-building-a-scalable-postgresql-solution
Natasha Mathur
14 Apr 2019
12 min read
Save for later

Building a scalable PostgreSQL solution 

Natasha Mathur
14 Apr 2019
12 min read
The term  Scalability means the ability of a software system to grow as the business using it grows. PostgreSQL provides some features that help you to build a scalable solution but, strictly speaking, PostgreSQL itself is not scalable. It can effectively utilize the following resources of a single machine: It uses multiple CPU cores to execute a single query faster with the parallel query feature When configured properly, it can use all available memory for caching The size of the database is not limited; PostgreSQL can utilize multiple hard disks when multiple tablespaces are created; with partitioning, the hard disks could be accessed simultaneously, which makes data processing faster However, when it comes to spreading a database solution to multiple machines, it can be quite problematic because a standard PostgreSQL server can only run on a single machine.  In this article, we will look at different scaling scenarios and their implementation in PostgreSQL. The requirement for a system to be scalable means that a system that supports a business now, should also be able to support the same business with the same quality of service as it grows. This article is an excerpt taken from the book  'Learning PostgreSQL 11 - Third Edition' written by Andrey Volkov and Salahadin Juba. The book explores the concepts of relational databases and their core principles.  You’ll get to grips with using data warehousing in analytical solutions and reports and scaling the database for high availability and performance. Let's say a database can store 1 GB of data and effectively process 100 queries per second. What if with the development of the business, the amount of data being processed grows 100 times? Will it be able to support 10,000 queries per second and process 100 GB of data? Maybe not now, and not in the same installation. However, a scalable solution should be ready to be expanded to be able to handle the load as soon as it is needed. In scenarios where it is required to achieve better performance, it is quite common to set up more servers that would handle additional load and copy the same data to them from a master server. In scenarios where high availability is required, this is also a typical solution to continuously copy the data to a standby server so that it could take over in case the master server crashes.  Scalable PostgreSQL solution Replication can be used in many scaling scenarios. Its primary purpose is to create and maintain a backup database in case of system failure. This is especially true for physical replication. However, replication can also be used to improve the performance of a solution based on PostgreSQL. Sometimes, third-party tools can be used to implement complex scaling scenarios. Scaling for heavy querying Imagine there's a system that's supposed to handle a lot of read requests. For example, there could be an application that implements an HTTP API endpoint that supports the auto-completion functionality on a website. Each time a user enters a character in a web form, the system searches in the database for objects whose name starts with the string the user has entered. The number of queries can be very big because of the large number of users, and also because several requests are processed for every user session. To handle large numbers of requests, the database should be able to utilize multiple CPU cores. In case the number of simultaneous requests is really large, the number of cores required to process them can be greater than a single machine could have. The same applies to a system that is supposed to handle multiple heavy queries at the same time. You don't need a lot of queries, but when the queries themselves are big, using as many CPUs as possible would offer a performance benefit—especially when parallel query execution is used. In such scenarios, where one database cannot handle the load, it's possible to set up multiple databases, set up replication from one master database to all of them, making each them work as a hot standby, and then let the application query different databases for different requests. The application itself can be smart and query a different database each time, but that would require a special implementation of the data-access component of the application, which could look as follows: Another option is to use a tool called Pgpool-II, which can work as a load-balancer in front of several PostgreSQL databases. The tool exposes a SQL interface, and applications can connect there as if it were a real PostgreSQL server. Then Pgpool-II will redirect the queries to the databases that are executing the fewest queries at that moment; in other words, it will perform load-balancing: Yet another option is to scale the application together with the databases so that one instance of the application will connect to one instance of the database. In that case, the users of the application should connect to one of the many instances. This can be achieved with HTTP load-balancing: Data sharding When the problem is not the number of concurrent queries, but the size of the database and the speed of a single query, a different approach can be implemented. The data can be separated into several servers, which will be queried in parallel, and then the result of the queries will be consolidated outside of those databases. This is called data sharding. PostgreSQL provides a way to implement sharding based on table partitioning, where partitions are located on different servers and another one, the master server, uses them as foreign tables. When performing a query on a parent table defined on the master server, depending on the WHERE clause and the definitions of the partitions, PostgreSQL can recognize which partitions contain the data that is requested and would query only these partitions. Depending on the query, sometimes joins, grouping and aggregation could be performed on the remote servers. PostgreSQL can query different partitions in parallel, which will effectively utilize the resources of several machines. Having all this, it's possible to build a solution when applications would connect to a single database that would physically execute their queries on different database servers depending on the data that is being queried. It's also possible to build sharding algorithms into the applications that use PostgreSQL. In short, applications would be expected to know what data is located in which database, write it only there, and read it only from there. This would add a lot of complexity to the applications. Another option is to use one of the PostgreSQL-based sharding solutions available on the market or open source solutions. They have their own pros and cons, but the common problem is that they are based on previous releases of PostgreSQL and don't use the most recent features (sometimes providing their own features instead). One of the most popular sharding solutions is Postgres-XL, which implements a shared-nothing architecture using multiple servers running PostgreSQL. The system has several components: Multiple data nodes: Store the data A single global transaction monitor (GTM): Manages the cluster, provides global transaction consistency Multiple coordinator nodes: Supports user connections, builds query-execution plans, and interacts with the GTM and the data nodes Postgres-XL implements the same API as PostgreSQL, therefore the applications don't need to treat the server in any special way. It is ACID-compliant, meaning it supports transactions and integrity constraints. The COPY command is also supported. The main benefits of using Postgres-XL are as follows: It can scale to support more reading operations by adding more data nodes It can scale for to support more writing operations by adding more coordinator nodes The current release of Postgres-XL (at the time of writing) is based on PostgreSQL 10, which is relatively new The main downside of Postgres-XL is that it does not provide any high-availability features out of the box. When more servers are added to a cluster, the probability of the failure of any of them increases. That's why you should take care with backups or implement replication of the data nodes themselves. Postgres-XL is open source, but commercial support is available. Another solution worth mentioning is Greenplum. It's positioned as an implementation of a massive parallel-processing database, specifically designed for data warehouses. It has the following components: Master node: Manages user connections, builds query execution plans, manages transactions Data nodes: Store the data and perform queries Greenplum also implements the PostgreSQL API, and applications can connect to a Greenplum database without any changes. It supports transactions, but support for integrity constraints is limited. The COPY command is supported. The main benefits of Greenplum are as follows: It can scale to support more reading operations by adding more data nodes. It supports column-oriented table organization, which can be useful for data-warehousing solutions. Data compression is supported. High-availability features are supported out of the box. It's possible (and recommended) to add a secondary master that would take over in case a primary master crashes. It's also possible to add mirrors to the data nodes to prevent data loss. The drawbacks are as follows: It doesn't scale to support more writing operations. Everything goes through the single master node and adding more data nodes does not make writing faster. However, it's possible to import data from files directly on the data nodes. It uses PostgreSQL 8.4 in its core. Greenplum has a lot of improvements and new features added to the base PostgreSQL code, but it's still based on a very old release; however, the system is being actively developed. Greenplum doesn't support foreign keys, and support for unique constraints is limited. There are commercial and open source editions of Greenplum. Scaling for many numbers of connections Yet another use case related to scalability is when the number of database connections is great.  However, when a single database is used in an environment with a lot of microservices and each has its own connection pool, even if they don't perform too many queries, it's possible that hundreds or even thousands of connections are opened in the database. Each connection consumes server resources and just the requirement to handle a great number of connections can already be a problem, without even performing any queries. If applications don't use connection pooling and open connections only when they need to query the database and close them afterwards, another problem could occur. Establishing a database connection takes time—not too much, but when the number of operations is great, the total overhead will be significant. There is a tool, named PgBouncer, that implements a connection-pool functionality. It can accept connections from many applications as if it were a PostgreSQL server and then open a limited number of connections towards the database. It would reuse the same database connections for multiple applications' connections. The process of establishing a connection from an application to PgBouncer is much faster than connecting to a real database because PgBouncer doesn't need to initialize a database backend process for the session. PgBouncer can create multiple connection pools that work in one of the three modes: Session mode: A connection to a PostgreSQL server is used for the lifetime of a client connection to PgBouncer. Such a setup could be used to speed up the connection process on the application side. This is the default mode. Transaction mode: A connection to PostgreSQL is used for a single transaction that a client performs. That could be used to reduce the number of connections at the PostgreSQL side when only a few translations are performed simultaneously. Statement mode: A database connection is used for a single statement. Then it is returned to the pool and a different connection is used for the next statement. This mode is similar to the transaction mode, though more aggressive. Note that multi-statement transactions are not possible when statement mode is used. Different pools can be set up to work in different modes. It's possible to let PgBouncer connect to multiple PostgreSQL servers, thus working as a reverse proxy. The way PgBouncer could be used is represented in the following diagram: PgBouncer establishes several connections to the database. When an application connects to PgBouncer and starts a transaction, PgBouncer assigns an existing database connection to that application, forwards all SQL commands to the database, and delivers the results back. When the transaction is finished, PgBouncer will dissociate the connections, but not close them. If another application starts a transaction, the same database connection could be used. Such a setup requires configuring PgBouncer to work in transaction mode. PostgreSQL provides several ways to implement replication that would maintain a copy of the data from a database on another server or servers. This can be used as a backup or a standby solution that would take over in case the main server crashes. Replication can also be used to improve the performance of a software system by making it possible to distribute the load on several database servers. In this article, we discussed the problem of building scalable solutions based on PostgreSQL utilizing the resources of several servers. We looked at scaling for querying, data sharding, as well as scaling for many numbers of connections.  If you enjoyed reading this article and want to explore other topics, be sure to check out the book 'Learning PostgreSQL 11 - Third Edition'. Handling backup and recovery in PostgreSQL 10 [Tutorial] Understanding SQL Server recovery models to effectively backup and restore your database Saving backups on cloud services with ElasticSearch plugins
Read more
  • 0
  • 0
  • 16935
article-image-a-non-programmers-guide-to-learning-machine-learning
Natasha Mathur
05 Sep 2018
11 min read
Save for later

A non programmer’s guide to learning Machine learning

Natasha Mathur
05 Sep 2018
11 min read
Artificial intelligence might seem intimidating, but it isn’t actually as complex as you might think. Many of the tools that have been developed over the last decade or so have all helped to make artificial intelligence and machine learning more accessible to engineers with varying degrees of experience and knowledge. Today, we’ve got to a stage where it’s now accessible even to people who have barely written a line of code in their life! Pretty exciting, right? But if you’re completely new to the field, it can be challenging to know how to get started - fortunately, we’re about to help you overcome that first hurdle. If you are an AI denier, then be sure to first read ‘why learn Machine Learning as a non-techie’ before you move forward. A strong purpose and belief is the first step to learning anything new. Alright, now here’s how you can get started with artificial intelligence and machine learning techniques quickly. 0. Use a free MLaaS or a no code interactive machine learning tool to experience first hand what is possible with learning machine learning: Some popular examples of no code machine learning as a service option are Microsoft Azure, BigML, Orange, and Amazon ML. Read Q2 under the FAQ section below to know more on this topic. 1. Learn Linear Algebra: Linear Algebra is the elementary unit for ML. It helps you effectively comprehend the theory behind the Machine learning algorithms and how they work. It also improves your math skills such as statistics, programming skills, which are all other skills that helps in ML. Learning Resources: Linear Algebra for Beginners: Open Doors to Great Careers Linear algebra Basics 2. Learn just enough Python or any programming: Now, you can get started with any language of your interest, but we suggest Python as  it’s great for people who are new to programming. It’s easy to learn due to its simple syntax. You’ll be able to quickly implement the ML algorithms. Also,  It has a rich development ecosystem that offers a ton of libraries and frameworks in Machine Learning such as Scikit Learn, Lasagne, Numpy, Scipy, Theano, Tensorflow, etc. Learning Resources: Python Machine Learning Learn Python in 7 Days Python for Beginners 2017 [Video] Learn Python with codecademy Python editor for beginner programmers 3. Learn basic Probability Theory and statistics: A lot of fundamental Statistical and Probability Theories form the basis for ML. You’ve probably already learned Probability and statistics in school, it easy to dive into advanced statistics for ML. Machine learning in its currently widely used form is a way to predict odds and see patterns. Knowing statistics and probability is important as it will help you with better understanding of why any machine learning algorithm works. For example, your grounding in this area, will help to ask the right questions, choose the right set of algorithms and know what to expect as answers from your ML model on questions such as: What are the odds of this person also liking this movie given their current movie watching choices ( Collaborative filtering and content-based filtering) How similar is this user to that group of users who brought a bunch of stuff on my site (clustering, collaborative filtering, and classification) Could this person be at risk of cancer given a certain set of traits and health indicator observations (logistic regression) Should you buy that stock (decision tree) Also, check out our interview with James D. Miller to know more about why learning stats is important in this field. Learning resources: Statistics for Data Science [Video] 4. Learn machine learning algorithms: Do not get intimidated!  You don’t have to be an expert to learn ML algorithms. Knowing basic ML algorithms that are majorly used in the real world applications like linear regression, naive Bayes, and decision trees, are enough to get you started. Learn what they do and how they are used in Machine Learning. 5. Learn numpy sci-kit learn,Keras or any other popular machine learning framework: It can be confusing initially to decide which framework to learn. Each one has its own advantages and disadvantages. Numpy is a linear algebra library which is useful for performing mathematical and logical operations. You can easily work with large multidimensional arrays using Numpy. Sci-kit learn helps with quick implementation of popular algorithms on datasets as just one line of code makes different algorithms available for you. Keras is minimalistic and straightforward with high-levels of extensibility, so it is easier to approach. Learning Resources:  Hands-on Machine Learning with TensorFlow [Video]  Hands-on Scikit-learn for Machine Learning [Video] If you have reached till here, it is time to put your learning into practice. Go ahead and create a simple linear regression model using some publicly available dataset in your area of interest. Kaggle, ourworldindata.org, UC Irvine Machine Learning repository, elitedatascience, all have a rich set of clean datasets in varied fields. Now, it is necessary to commit and put in daily efforts to practise these skills. Quora, Reddit, Medium, and stackoverflow will be your best friends when it comes to solving doubts regarding any of these skills. Data Helpers is another great resource that provides newcomers with help on queries regarding entering the ML field and related topics. Additionally, once you start getting hang of these skills, identify your strengths and interests, to realign your career goals. Research on the kind of work you want to put your newly gained Machine Learning skill to use. It needn’t be professional or serious, it just needs to be something that you deeply care about or are passionate about. This will pull you through your learning milestones, should you feel low at some point. Also, don’t forget to collaborate with other people and learn from them. You can work with web developers, software programmers, data analysts, data administrators, game developers etc. Finally, keep yourself updated with all the latest happenings in the ML world. Follow top experts and influencers on social media, top blogs on Machine Learning, and conferences. Once you are done checking off these steps off your list, you’ll be ready to start off with your ML project.                                                  Now, we’ll be looking at the most frequently asked questions by beginners in the field of Machine learning. Frequently asked questions by Beginners in ML As a beginner, it’s natural to have a lot of questions regarding ML. We’ll be addressing the top three frequently asked questions by beginners or non-programmers when it comes to Machine learning: Q.1 I am looking to make a career in Machine learning but I have no prior programming experience. Do I need to know programming for Machine learning? In a nutshell, Yes. If you want a career in Machine learning then having some form of programming knowledge really helps. As mentioned earlier in this article, learning a programming language can really help you with implementing ML algorithms. It also lets you know the internal mechanism behind Machine learning. So, having programming as a prior skill is great. Again, as mentioned before, you can get started with Python which is the easiest and the most common languages for ML. However, programming is just a part of Machine learning. For instance, “machine learning engineers” typically write more code than develop models, while “research scientists” work more on modelling and analyzing different models. Now, ML is based on the principles of statistical inference and for talking statistically to the computer, we need a language, there comes Coding. So, even though the nature of your job in ML might not require you to code as much, there’s still some amount of coding required. Read Also: Why is Python so good for AI and ML? 5 Python Experts Explain Top languages for Artificial Intelligence development Q.2 Are there any tools that can help me with Machine learning without touching a single line of code? Yes. With the rise of MLaaS (Machine learning as a service), there are certain tools that help you get started with machine learning right-away. These are especially useful for business applications of ML, such as predictive modelling and clustering. Read Also: How MLaaS is transforming cloud Some of the most popular ones are: BigML:  This cloud based web-service lets you upload your data, prepare it and run algorithms on it. It’s great for people with not so extensive data science backgrounds. It offers a clean and easy to use interfaces for configuring algorithms (decision trees) and reviewing the results. Being focused “only” on Machine Learning, it comes with a wide set of features, all well integrated within a usable Web UI. Other than that, it also offers an API so that if you like it you can build an application around it. Microsoft Azure: The Microsoft Azure ML studio is a “GUI-based integrated development environment for constructing and operationalizing Machine Learning workflow on Azure”. So, via an integrated development environment called ML Studio, people without data science background or non-programmers can also build data models with the help of drag-and-drop gestures and simple data flow diagrams. This also saves a lot of time through ML Studio's library of sample experiments. Learning resources: Microsoft Azure Machine Learning Machine Learning In The Cloud With Azure ML[Video] Orange: This is an open source machine learning and data visualization studio for novice and experts alike. It provides a toolbox comprising of text mining (topic modelling) and image recognition. It also offers a design tool for visual programming which allows you to connect together data preparation, algorithms, and result evaluation, thereby, creating machine learning “programs”. Apart from that, it provides over 100 widgets for the environment and there’s also a Python API and library available which you can integrate into your application. Amazon ML: Amazon ML is a part of Amazon Web Services ( AWS ) that combines powerful machine learning algorithms with interactive visual tools to guide you towards easily creating, evaluating, and deploying machine learning models. So, whether you are a data scientist or a newbie, it offers ML services and tools tailored to meet your needs and level of expertise. Building ML models using Amazon ML consists of three operations: data analysis, model training, and evaluation. Learning Resources: Effective Amazon Machine Learning Q.3  Do I need to know advanced mathematics ( college graduate level ) to learn Machine learning? It depends. As mentioned earlier, understanding of the following mathematical topics: Probability, Statistics and Linear Algebra can really make your machine learning journey easier and also help simplify your code. These help you understand the “why” behind the working of the machine learning algorithms, which is quite fundamental to understanding ML. However, not knowing advanced mathematics is not an excuse to not learning Machine Learning. There a lot of libraries which makes the task of applying an ML algorithm to solve a task easier. One such example is the widely used Python’s scikit-learn library. With scikit-learn, you just need one line of code and you’ll have the most common algorithms there for you, ready to be used. But, if you want to go deeper into machine learning then knowing advanced mathematics is a prerequisite as it will help you understand the algorithms, the formulas, how the learning is done and many other Machine Learning concepts. Also, with so many courses and tutorials online, you can always learn advanced mathematics on the side while exploring Machine learning. So, we looked at the three most asked questions by beginners in the field of Machine Learning. In the past, machine learning has provided us with self-driving cars, effective web search, speech recognition, etc. Machine learning is extremely pervasive, in fact, many researchers believe that ML is the best way to make progress towards human-level AI. Learning ML is not an easy task but its not next to impossible either. In the end, it all depends on the amount of dedication and efforts that you’re willing to put in to get a grasp of it. We just touched the tip of the iceberg in this article, there’s a lot more to know in Machine Learning which you will get a hang of as you get your feet dirty in it. That being said, all the best for the road ahead! Facebook launches a 6-part ML video series 7 of the best ML conferences for the rest of 2018 Google introduces Machine Learning courses for AI beginners
Read more
  • 0
  • 0
  • 16815

article-image-dl-frameworks-tensorflow-vs-cntk
Aaron Lazar
30 Oct 2017
6 min read
Save for later

The Deep Learning Framework Showdown: TensorFlow vs CNTK

Aaron Lazar
30 Oct 2017
6 min read
The question several Deep Learning engineers may ask themselves is: Which is better, TensorFlow or CNTK? Well, we're going to answer that question for you, taking you through a closely fought match between the two most exciting frameworks. So, here we are, ladies and gentlemen, it's fight night and it's a full house. In the Red corner, weighing in at two hundred and seventy pounds of Python and topping out at over ten thousand frames per second; managed by the American tech giant, Google; we have the mighty, the beefy, TensorFlow! In the Blue corner, weighing in at two hundred and thirty pounds of C++ muscle, we have, one of the top toolkits that can comfortably scale beyond a single machine. Managed by none other than Microsoft, it's fast, it's furious, it's CNTK aka the Microsoft Cognitive Toolkit! And we're into Round One… TensorFlow and CNTK are looking quite menacingly at each other and are raging to take down their opponents. TensorFlow seems pleased that its compile times are considerably faster than its successor, Theano. Although, it looks like happiness came a tad bit soon. CNTK, light and bouncy on it's feet, comes straight out of nowhere with a whopping seventy thousand frames/second upper cut, knocking TensorFlow to the floor. TensorFlow looks like it's in no mood to give up anytime soon. It makes itself so simple to use and understand that even students can pick it up and start training their own models. This isn't the case with CNTK, as it begs to shed its complexity. On the other hand, CNTK seems to be thrashing TensorFlow in terms of 3D convolution, where CNTK can clearly recognize images from streaming content. TensorFlow also tries its best to run LSTM RNNs, but in vain. The crowd keeps cheering on… Wait a minute...are they calling out for TensorFlow? Yes they are! There's hardly any cheering for CNTK. This is embarrassing! Looks like its community support can't match up to TensorFlow's. And ladies and gentlemen, that does make a difference - we can see TensorFlow improving on several fronts and gradually getting back in the game! TensorFlow huffs and puffs as it tries to prove that it's not just about deep learning and that it has tools in the pocket that can support other algorithms such as reinforcement learning. It conveniently whips out the TensorBoard, and drops CNTK to the floor with its beautiful visualizations. TensorFlow now has the upper hand and is trying hard to pin CNTK to the floor and tries to use its R support to finish it off. But CNTK tactfully breaks loose and leaves TensorFlow on the floor - still not ready to be used in production. And there goes the bell for Round One! Both fighters look exhausted but you can see a faint twinkle in TensorFlow's eye, primarily because it survived Round One. Google seems to be working hard to prep it for Round Two and is making several improvements in terms of speed, flexibility and majorly making it ready for production. Meanwhile, Microsoft boosts CNTK's spirits with a shot of Python APIs in its blood. As it moves towards reaching version 2.0, there are a lot of improvements to CNTK, wherein, Microsoft has ensured that it's not left behind, like having a backend for Keras, which puts it on par with TensorFlow. Moreover, there seem to be quite a few experimental features that it looks ready to enter the ring with, like the Java API for example. It's the final round and boy, are these two into a serious stare-down! The referee waves them in and off they are. CNTK needs to get back at TensorFlow. Comfortably supporting multiple GPUs and CPUs out of the box, across both the Microsoft and Linux platforms, it has an advantage over TensorFlow. Is it going to use that trump card? Yes it is! A thousand GPUs and a hundred machines in, and CNTK is raining blows on TensorFlow. TensorFlow clearly drops the ball when it comes to multiple machines, and it rather complicates things. It's high time that TensorFlow turned the tables. Lo and behold! It shows off its mobile deep learning capabilities with TensorFlow Lite, clearly flipping CNTK flat on its back. This is revolutionary and a tremendous breakthrough for TensorFlow! CNTK, however, is clearly the people's choice when it comes to language compatibility. With support for C++, Python, C#/.NET and now Java, it's clearly winning in this area. Round Two is coming to an end, ladies and gentlemen and it's a neck to neck battle out there. We're not sure the judges are going to be able to choose a clear winner, from the looks of it. And…. there goes the bell! While the scores are being tallied, we go over to the teams and some spectators for some gossip on the what's what of deep learning. Did you know having multiple machine support is a huge advantage? It increases speed and efficiency by almost 10 times! That's something! We also got to know that TensorFlow is training hard and is picking up positives from its rival, CNTK. There are also rumors about a new kid called MXNet (read about it here), that has APIs in R, Python and even in Julia! This makes it one helluva framework in terms of flexibility and speed. In fact, AWS is already implementing it while Apple also is rumored to be using it. Clearly, something to watch out for. And finally, the judges have made their decision. Ladies and gentlemen, after two rounds of sheer entertainment, we have the results... TensorFlow CNTK Processing speed 0 1 Learning curve 1 0 Production readiness 0 1 Community support 1 0 CPU, GPU computation support 0 1 Mobile deep learning 1 0 Multiple language compatibility 0 1 It's a unanimous decision and just as we thought, CNTK is the heavyweight champion! CNTK clearly beat TensorFlow in terms of performance, because of its flexibility, speed and ability to use in production! As a Deep Learning engineer, should you be wanting to use one of these frameworks in your tasks, you should check out their features thoroughly, test them out with a test dataset and then implement them to your actual data. After all, it's the choices we make that define a win or a loss - simplicity over resource utilisation, or speed over platform, we must choose our tools wisely. For more information on the kind of tests that both the tools have been put through, read the Research Paper presented by Shaohuai Shi, Qiang Wang, Pengfei Xu and Xiaowen Chu from the Department of Computer Science, Hong Kong Baptist University and these benchmarks.
Read more
  • 0
  • 1
  • 15212