Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Tech Guides - Data

281 Articles
article-image-ai-deserve-to-be-so-overhyped
Aaron Lazar
28 May 2018
6 min read
Save for later

Does AI deserve to be so Overhyped?

Aaron Lazar
28 May 2018
6 min read
The short answer is yes, and no. The long answer is, well, read on to find out. Several have been asking the question, including myself, wondering whether Artificial Intelligence is just another passing fad like maybe the Google Glass or nano technology. The hype for AI began over the past few years, although if you actually look back at the 60’s it seems to have started way back then. In the early 90s and all the way down to the early 2000’s, a lot of media and television shows were talking about AI quite a bit. Going 25 centuries even further back, Aristotle speaks of not just thinking machines but goes on to talk of autonomous ones in his book, Politics: for if every instrument, at command, or from a preconception of its master's will, could accomplish its work (as the story goes of the statues of Daedalus; or what the poet tells us of the tripods of Vulcan, "that they moved of their own accord into the assembly of the gods "), the shuttle would then weave, and the lyre play of itself; nor would the architect want servants, or the [1254a] master slaves. Aristotle, Politics: A treatise on Government, Book 1, Chapter 4 This imagery of AI has managed to sink into our subconscious minds over the centuries propelling creative work, academic research and industrial revolutions toward that goal. The thought of giving machines a mind of their own, existed quite long ago, but recent advancements in technology have made it much clearer and realistic. The Rise of the Machines The year is 2018. The 4th Industrial Revolution is happening and intelligent automation has taken over. This is the point where I say no, AI is not overhyped. General Electric, for example, is a billion dollar manufacturing company that has already invested in AI. GE Digital has AI systems running through several automated systems. They even have their own IIoT platform called Predix. Similarly, in the field of healthcare, the implementation of AI is growing in leaps and bounds. The Google Deepmind project is able to process millions of medical records within minutes. Although this kind of research is in its early phase, Google is working closely with the Moorfields Eye Hospital NHS Foundation Trust to implement AI and improve eye treatment. AI startups focused on healthcare and other allied areas such as genetic engineering are some of the highly invested and venture capital supported ones in recent times. Computer Vision or image recognition is one field where AI has really proven its power. Analysing datasets like iris has never been easier, paving way for more advanced use cases like automated quality checks in manufacturing units. Another interesting field is Healthcare, where AI has helped sift through tonnes of data, helping doctors diagnose illnesses quicker, manufacture more effective and responsive drugs, and in patient monitoring. The list is endless, clearly showing that AI has made its mark in several industries. Back (up) to the Future Now, if you talk about the commercial implementations of AI, they’re still quite far fetched at the moment. Take the same Computer Vision application for example. Its implementation will be a huge breakthrough in autonomous vehicles. But if researchers have managed to obtain an 80% accuracy for object recognition on roads, the battle is not close to being won! Even if they do improve, do you think driverless vehicles are ready to drive in the snow, through the rain or even storms? I remember a few years ago, Business Process Outsourcing was one industry, at least in India, that was quite fearful of the entry of AI and autonomous systems that might take over their jobs. Machines are only capable of performing 60-70% of the BPO processes in Insurance, and with changing customer requirements and simultaneously falling patience levels, these numbers are terrible! It looks like the end of Moore’s law is here, for AI I mean. Well, you can’t really expect AI to have the same exponential growth that computers did, decades ago. There are a lot of unmet expectations in several fields, which has a considerable number of people thinking that AI isn’t going to solve their problems now, and they’re right. It is probably going to take a few more years to mature, making it a thing of the future, not of the present. Is AI overhyped now? Yeah, maybe? What I think Someone once said, hype is a double-edged sword. If it’s not enough, innovation may become obscure and if it’s too much, expectations will become unreasonable. It’s true that AI has several beneficial use cases, but what about fairness of such systems? Will machines continue to think the way they’re supposed to or will they start finding their own missions that don’t involve benefits to the human race? At the same time, there’s also a question of security and data privacy. GDPR will come into effect in a few days, but what about the prevailing issues of internet security? I had an interesting discussion with a colleague yesterday. We were talking about what the impact of AI could be for us as end-customers, in a developing and young country like India. Do we really need to fear losing our jobs, will we be able to reap the benefits of AI directly or would it be an indirect impact? The answer is, probably yes, but not so soon. If we drew up the hierarchy of needs pyramid for AI, it would look something like the above. For each field to fully leverage AI, it’s going to involve several stages like collecting data, storing it effectively, exploring it, then aggregating it, optimising it with the help of algorithms and then finally achieving AI. That’s bound to take a LOT of time! Honestly speaking, a country like India lacks as much implementation of AI in several fields. The major customers of AI, apart from some industrial giants, will obviously be the government. Although, that is sure to take at least a decade or so, keeping in mind the several aspects to be accomplished first. In the meantime, buddying AI developers and engineers are scurrying to skill themselves up in the race to be in the cream of the crowd! Similarly, what about the rest of the world? Well, I can’t speak for everyone, but if you ask me, AI is a really promising technology and I think we need to give it some time; allow the industries and organisations investing in it to take enough time to let it evolve and ultimately benefit us customers, one way or another. You can now make music with AI thanks to Magenta.js Splunk leverages AI in its monitoring tools    
Read more
  • 0
  • 0
  • 6822

article-image-libraries-for-geospatial-analysis
Aarthi Kumaraswamy
22 May 2018
12 min read
Save for later

Top 7 libraries for geospatial analysis

Aarthi Kumaraswamy
22 May 2018
12 min read
The term geospatial refers to finding information that is located on the earth's surface. This can include, for example, the position of a cellphone tower, the shape of a road, or the outline of a country. Geospatial data often associates some piece of information with a particular location. Geospatial development is the process of writing computer programs that can access, manipulate, and display this type of information. Internally, geospatial data is represented as a series of coordinates, often in the form of latitude and longitude values. Additional attributes, such as temperature, soil type, height, or the name of a landmark, are also often present. There can be many thousands (or even millions) of data points for a single set of geospatial data. In addition to the prosaic tasks of importing geospatial data from various external file formats and translating data from one projection to another, geospatial data can also be manipulated to solve various interesting problems. Obvious examples include the task of calculating the distance between two points, calculating the length of a road, or finding all data points within a given radius of a selected point. We use libraries to solve all of these problems and more. Today we will look at the major libraries used to process and analyze geospatial data. GDAL/OGR GEOS Shapely Fiona Python Shapefile Library (pyshp) pyproj Rasterio GeoPandas This is an excerpt from the book, Mastering Geospatial Analysis with Python by Paul Crickard, Eric van Rees, and Silas Toms. Geospatial Data Abstraction Library (GDAL) and the OGR Simple Features Library The Geospatial Data Abstraction Library (GDAL)/OGR Simple Features Library combines two separate libraries that are generally downloaded together as a GDAL. This means that installing the GDAL package also gives access to OGR functionality. The reason GDAL is covered first is that other packages were written after GDAL, so chronologically, it comes first. As you will notice, some of the packages covered in this post extend GDAL's functionality or use it under the hood. GDAL was created in the 1990s by Frank Warmerdam and saw its first release in June 2000. Later, the development of GDAL was transferred to the Open Source Geospatial Foundation (OSGeo). Technically, GDAL is a little different than your average Python package as the GDAL package itself was written in C and C++, meaning that in order to be able to use it in Python, you need to compile GDAL and its associated Python bindings. However, using conda and Anaconda makes it relatively easy to get started quickly. Because it was written in C and C++, the online GDAL documentation is written in the C++ version of the libraries. For Python developers, this can be challenging, but many functions are documented and can be consulted with the built-in pydoc utility, or by using the help function within Python. Because of its history, working with GDAL in Python also feels a lot like working in C++ rather than pure Python. For example, a naming convention in OGR is different than Python's since you use uppercase for functions instead of lowercase. These differences explain the choice for some of the other Python libraries such as Rasterio and Shapely, which are also covered in this chapter, that has been written from a Python developer's perspective but offer the same GDAL functionality. GDAL is a massive and widely used data library for raster data. It supports the reading and writing of many raster file formats, with the latest version counting up to 200 different file formats that are supported. Because of this, it is indispensable for geospatial data management and analysis. Used together with other Python libraries, GDAL enables some powerful remote sensing functionalities. It's also an industry standard and is present in commercial and open source GIS software. The OGR library is used to read and write vector-format geospatial data, supporting reading and writing data in many different formats. OGR uses a consistent model to be able to manage many different vector data formats. You can use OGR to do vector reprojection, vector data format conversion, vector attribute data filtering, and more. GDAL/OGR libraries are not only useful for Python programmers but are also used by many GIS vendors and open source projects. The latest GDAL version at the time of writing is 2.2.4, which was released in March 2018. GEOS The Geometry Engine Open Source (GEOS) is the C/C++ port of a subset of the Java Topology Suite (JTS) and selected functions. GEOS aims to contain the complete functionality of JTS in C++. It can be compiled on many platforms, including Python. As you will see later on, the Shapely library uses functions from the GEOS library. In fact, there are many applications using GEOS, including PostGIS and QGIS. GeoDjango, also uses GEOS, as well as GDAL, among other geospatial libraries. GEOS can also be compiled with GDAL, giving OGR all of its capabilities. The JTS is an open source geospatial computational geometry library written in Java. It provides various functionalities, including a geometry model, geometric functions, spatial structures and algorithms, and i/o capabilities. Using GEOS, you have access to the following capabilities—geospatial functions (such as within and contains), geospatial operations (union, intersection, and many more), spatial indexing, Open Geospatial Consortium (OGC) well-known text (WKT) and well-known binary (WKB) input/output, the C and C++ APIs, and thread safety. Shapely Shapely is a Python package for manipulation and analysis of planar features, using functions from the GEOS library (the engine of PostGIS) and a port of the JTS. Shapely is not concerned with data formats or coordinate systems but can be readily integrated with such packages. Shapely only deals with analyzing geometries and offers no capabilities for reading and writing geospatial files. It was developed by Sean Gillies, who was also the person behind Fiona and Rasterio. Shapely supports eight fundamental geometry types that are implemented as a class in the shapely.geometry module—points, multipoints, linestrings, multilinestrings, linearrings, multipolygons, polygons, and geometrycollections. Apart from representing these geometries, Shapely can be used to manipulate and analyze geometries through a number of methods and attributes. Shapely has mainly the same classes and functions as OGR while dealing with geometries. The difference between Shapely and OGR is that Shapely has a more Pythonic and very intuitive interface, is better optimized, and has a well-developed documentation. With Shapely, you're writing pure Python, whereas with GEOS, you're writing C++ in Python. For data munging, a term used for data management and analysis, you're better off writing in pure Python rather than C++, which explains why these libraries were created. For more information on Shapely, consult the documentation. This page also has detailed information on installing Shapely for different platforms and how to build Shapely from the source for compatibility with other modules that depend on GEOS. This refers to the fact that installing Shapely will require you to upgrade NumPy and GEOS if these are already installed. Fiona Fiona is the API of OGR. It can be used for reading and writing data formats. The main reason for using it instead of OGR is that it's closer to Python than OGR as well as more dependable and less error-prone. It makes use of two markup languages, WKT and WKB, for representing spatial information with regards to vector data. As such, it can be combined well with other Python libraries such as Shapely, you would use Fiona for input and output, and Shapely for creating and manipulating geospatial data. While Fiona is Python compatible and our recommendation, users should also be aware of some of the disadvantages. It is more dependable than OGR because it uses Python objects for copying vector data instead of C pointers, which also means that they use more memory, which affects the performance. Python shapefile library (pyshp) The Python shapefile library (pyshp) is a pure Python library and is used to read and write shapefiles. The pyshp library's sole purpose is to work with shapefiles—it only uses the Python standard library. You cannot use it for geometric operations. If you're only working with shapefiles, this one-file-only library is simpler than using GDAL. pyproj The pyproj is a Python package that performs cartographic transformations and geodetic computations. It is a Cython wrapper to provide Python interfaces to PROJ.4 functions, meaning you can access an existing library of C code in Python. PROJ.4 is a projection library that transforms data among many coordinate systems and is also available through GDAL and OGR. The reason that PROJ.4 is still popular and widely used is two-fold: Firstly, because it supports so many different coordinate systems Secondly, because of the routes it provides to do this—Rasterio and GeoPandas, two Python libraries covered next, both use pyproj and thus PROJ.4 functionality under the hood The difference between using PROJ.4 separately instead of using it with a package such as GDAL is that it enables you to re-project individual points, and packages using PROJ.4 do not offer this functionality. The pyproj package offers two classes—the Proj class and the Geod class. The Proj class performs cartographic computations, while the Geod class performs geodetic computations. Rasterio Rasterio is a GDAL and NumPy-based Python library for raster data, written with the Python developer in mind instead of C, using Python language types, protocols, and idioms. Rasterio aims to make GIS data more accessible to Python programmers and helps GIS analysts learn important Python standards. Rasterio relies on concepts of Python rather than GIS. Rasterio is an open source project from the satellite team of Mapbox, a provider of custom online maps for websites and applications. The name of this library should be pronounced as raster-i-o rather than ras-te-rio. Rasterio came into being as a result of a project called the Mapbox Cloudless Atlas, which aimed to create a pretty-looking basemap from satellite imagery. One of the software requirements was to use open source software and a high-level language with handy multi-dimensional array syntax. Although GDAL offers proven algorithms and drivers, developing with GDAL's Python bindings feels a lot like C++. Therefore, Rasterio was designed to be a Python package at the top, with extension modules (using Cython) in the middle, and a GDAL shared library on the bottom. Other requirements for the raster library were being able to read and write NumPy ndarrays to and from data files, use Python types, protocols, and idioms instead of C or C++ to free programmers from having to code in two languages. For georeferencing, Rasterio follows the lead of pyproj. There are a couple of capabilities added on top of reading and writing, one of them being a features module. Reprojection of geospatial data can be done with the rasterio.warp module. Rasterio's project homepage can be found on Github. GeoPandas GeoPandas is a Python library for working with vector data. It is based on the pandas library that is part of the SciPy stack. SciPy is a popular library for data inspection and analysis, but unfortunately, it cannot read spatial data. GeoPandas was created to fill this gap, taking pandas data objects as a starting point. The library also adds functionality from geographical Python packages. GeoPandas offers two data objects—a GeoSeries object that is based on a pandas Series object and a GeoDataFrame, based on a pandas DataFrame object, but adding a geometry column for each row. Both GeoSeries and GeoDataFrame objects can be used for spatial data processing, similar to spatial databases. Read and write functionality is provided for almost every vector data format. Also, because both Series and DataFrame objects are subclasses from pandas data objects, you can use the same properties to select or subset data, for example .loc or .iloc. GeoPandas is a library that employs the capabilities of newer tools, such as Jupyter Notebooks, pretty well, whereas GDAL enables you to interact with data records inside of vector and raster datasets through Python code. GeoPandas takes a more visual approach by loading all records into a GeoDataFrame so that you can see them all together on your screen. The same goes for plotting data. These functionalities were lacking in Python 2 as developers were dependent on IDEs without extensive data visualization capabilities which are now available with Jupyter Notebooks. We've provided an overview of the most important open source packages for processing and analyzing geospatial data. The question then becomes when to use a certain package and why. GDAL, OGR, and GEOS are indispensable for geospatial processing and analyzing, but were not written in Python, and so they require Python binaries for Python developers. Fiona, Shapely, and pyproj were written to solve these problems, as well as the newer Rasterio library. For a more Pythonic approach, these newer packages are preferable to the older C++ packages with Python binaries (although they're used under the hood). Now that you have an idea of what options are available for a certain use case and why one package is preferable over another, here’s something you should always remember. As is often the way in programming, there might be multiple solutions for one particular problem. For example, when dealing with shapefiles, you could use pyshp, GDAL, Shapely, or GeoPandas, depending on your preference and the problem at hand. Introduction to Data Analysis and Libraries 15 Useful Python Libraries to make your Data Science tasks Easier “Pandas is an effective tool to explore and analyze data”: An interview with Theodore Petrou Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data  
Read more
  • 0
  • 0
  • 19112

article-image-dask-library-scalable-analytics-python
Amey Varangaonkar
22 May 2018
6 min read
Save for later

Introducing Dask: The library that makes scalable analytics in Python easier

Amey Varangaonkar
22 May 2018
6 min read
Python’s rise as the preferred language of choice in Data Science is unprecedented, but not really unexpected. Apart from being a general-purpose language which can be used for a variety of tasks - from scripting to networking, Python offers a rich suite of libraries for general data science tasks such as scientific computing, data visualization, and more. However, one big challenge faced by the data scientists is that these packages are not designed for scale. This is crucial in today’s Big Data era where tons of data needs to be processed and analyzed on the go. A platform which supports the existing Python ecosystem and allows it to scale across multiple machines and clusters without affecting the performance was conspicuously missing. Enter Dask. What is Dask? Dask is a flexible parallel computing library written in Python for analytics, designed mainly to offer scalability and enhanced power to the existing packages and libraries. It allows the users to integrate their existing Python-based projects written in popular libraries such as NumPy, SciPy, pandas, and more. Architecture is demonstrated in the diagram below: Architecture (Image courtesy: Slideshare) The 2 key components of Dask that interact with the Python libraries are: Dynamic task schedulers - which takes care of the intensive computational workloads ‘Big Data’ Dask collections - consisting of dataframes, parallel arrays and interfaces that allow for the computations to run on distributed environments Why use Dask? Given there are already quite a few distributed platforms for large-scale data processing such as Apache Spark, Apache Storm, Flink and so on, why and when should one go for Dask? What are the advantages offered by this Python library? Let us take a look at the 4 major reasons to prefer Dask for distributed, scalable analytics in Python: Easy to get started: If you are an existing Python user, you must have already worked with popular Python packages such as NumPy, SciPy, matplotlib, scikit-learn, pandas, and more. Dask offers a similar, intuitive interface and since it is a part of the bigger Python ecosystem, getting started with Dask is very easy. It uses the existing Python APIs to switch between the popular packages and their Dask-equivalents, so you don’t have to spend a lot of time in porting the code. For absolute beginners, using Dask for scalable analytics would be an easier and logical option to pursue, once they have grasped the fundamentals of Python and the associated libraries. Scales up and down quite easily: You can run your project on Dask on a single machine, or on a cluster with thousands of cores without essentially affecting the speed and performance of your code. Dask uses the multi-core CPUs within a single system optimally to process hundreds of terabytes of data without the need for additional hardware. Similarly, for moderate to large datasets spanning 100+ gigabytes which often don’t fit into a single storage device, the computing power of the clusters can be coupled with Dask for effective analytics. Supports complex applications: Many companies tend to tackle complex computations by introducing custom codes that run on popular Big Data tools such as Hadoop MapReduce and Apache Spark. However, with the help of the dynamic task schedule feature of Dask, it is now possible to run and process complex applications without introducing any additional code. Dask is solely responsible for the smooth handling of various tasks such as network communication, load balancing and diagnostics, among the others. Clear, responsive, real-time feedback: One of the most important features of Dask is its user-friendliness. Dask provides a real-time dashboard that highlights the key metrics of the processing task undertaken by the user - such as the current progress of your project, memory consumption and more. It also offers an in-built IPython kernel that allows the user to investigate the ongoing computation with just a terminal. How Dask compares with Apache Spark Apache Spark is one of the most popular and widely used Big Data tools for distributed data processing and analytics. Dask and Apache Spark have many features in common, prompting us and many other developers to ask the question - which tool is better? While Spark has been around for quite some and has many standard, stable features over years of development, Dask is quite new and is still being improved as a tool. We summarize the important differences between Dask and Apache Spark in the table below: CriteriaApache SparkDaskPrimary languageScalaPythonScaleSupports a single node to thousands of nodes in the clusterSupports a single node to thousands of nodes in the clusterEcosystemAll-in-one self-sufficient ecosystemIntegration with popular libraries within the Python ecosystemFlexibilityLowHighStream processingBuilt-in module called Spark Streaming presentReal-time interface which is pretty low-level, requires more work than Apache SparkGraph processingPossible with GraphX moduleNot possibleMachine learningUses the Spark MLlib moduleIntegrates with scikit-learn and XGBoostPopularityVery high, commonly used tool in the Big Data ecosystemFairly new tool but has already found its place in the pandas, scikit-learn and Jupyter stack   You can read a detailed comparison of Apache Spark and Dask on the official Dask documentation page. What we can expect from Dask As we saw from the comparison above, it is fairly easy to port an existing Python project using several high-profile Python libraries such as NumPy, scikit-learn and more. Python developers and data scientists will appreciate the high flexibility and complex computational capabilities offered by Dask. The limited stream processing and graph processing features are big areas of improvement, but we can expect some developments in this domain in the near future. Even though Dask is still relatively new, it looks very promising due to its close affinity with the Python ecosystem. With Python’s clout rising, many people would prefer a Python-based data processing tool which works at scale, without having to switch to an external Big Data framework. Dask may well be the superhero to come to the developers’ rescue, in such cases. You can learn more about the latest developments in Dask on their official GitHub page. Read more Is Apache Spark today’s Hadoop? Apache Spark 2.3 now has native Kubernetes support! Should you move to Python 3? 7 Python experts’ opinions
Read more
  • 0
  • 0
  • 3853
Banner background image

article-image-facebooks-wit-ai-why-we-need-yet-another-chatbot-development-framework
Sunith Shetty
21 May 2018
4 min read
Save for later

Facebook’s Wit.ai: Why we need yet another chatbot development framework?

Sunith Shetty
21 May 2018
4 min read
Chatbots are remarkably changing the way customer service is provided in a variety of industries. For every organization, customer satisfaction plays a very important role, thus they expect business to be reachable any time and respond to their queries 24*7. With growing artificial intelligence advances in smart devices and IoT, chatbots are becoming a necessity for communicating with customers in real time. There are many existing vendors such as Google, Microsoft, Amazon, and IBM with the required models and services to build conversational interfaces for the applications and devices. But the chatbot industry is evolving and even minor improvements in the UI, or the algorithms that work behind the scenes or the data they use to get trained, can mean a major win. With complete backing by the Facebook team, we can expect Wit.ai creating new simplified ways to ease speech recognition and voice interface for developers.  Wit.ai has an excellent support for NLP making it one of the popular bot frameworks in the market. The key to chatbot success is to pursue continuous learning that enables them to leverage relevant data in order to connect with clearly defined customers, this what makes Wit.ai extra special. What is Wit.ai? Wit.ai is an open and extensible NLP engine for developers, acquired by Facebook, which allows you to build conversational applications and devices that you can talk or text to. It provides an easy interface and quick learning APIs to understand human communication from every interaction and helps to parse the complex message (which can be either voice or text) into structured data. It also helps you with predicting the forthcoming set of events based on the learning from the gathered data. Why Wit.ai It is one of the most powerful APIs used to understand natural language It is a free SaaS platform that provides services for developers to build a chatbot for their app or device. It has story support thus allowing you to visualize the user experience. A new built-in support NLP integration with the Page inbox allows the page admins to create a Wit app with ease. Further by using the anonymized samples from past messages, the bot provides automate responses to the most common requests asked. You can create efficient and powerful text or voice based conversational bots that humans can chat with. In addition to business bots, these APIs can be used to build hands-free voice interfaces for mobile phones, wearable devices, home automation products and more. It can be used in platforms that learn new commands semantically to those input by the developer. It provides a developer GUI which includes a visual representation of the conversation flows, business logic invocations, context variables, jumps, and branching logic. Programming language and integration support - Node.js client, Python client, Ruby client, and HTTP API. Challenges in Wit.ai Wit.ai doesn’t support third-party integration tools. Wit.ai has no required slot/parameter feature. Thus you will have to invoke business logic every time there is an interaction with the user in order to gather any missing information not spoken by the user. Training the engine can take some time based on the task performed. When the number of stories increases, Wit engine becomes slower. However, existing Wit.ai adoption looks very promising, with more than 160,000 members in the community contributing on GitHub. In order to have a  complete coverage of tutorials, documentation and client support APIs you can visit the Github page to see a list of repositories. My friend, the robot: Artificial Intelligence needs Emotional Intelligence Snips open sources Snips NLU, its Natural Language Understanding engine What can Google Duplex do for businesses?  
Read more
  • 0
  • 0
  • 9599

article-image-tools-for-reinforcement-learning
Pravin Dhandre
21 May 2018
4 min read
Save for later

Top 5 tools for reinforcement learning

Pravin Dhandre
21 May 2018
4 min read
After deep learning, reinforcement Learning (RL), the hottest branch of Artificial Intelligence that is finding speedy adoption in tech-driven companies. Simply put, reinforcement learning is all about algorithms tracking previous actions or behaviour and providing optimized decisions using trial-and-error principle. Read How Reinforcement Learning works to know more. It might sound theoretical but gigantic firms like Google and Uber have tested out this exceptional mechanism and have been highly successful in cutting edge applied robotics fields such as self driving vehicles. Other top giants including Amazon, Facebook and Microsoft have centralized their innovations around deep reinforcement learning across Automotive, Supply Chain, Networking, Finance and Robotics. With such humongous achievement, reinforcement learning libraries has caught the Artificial Intelligence developer communities’ eye and have gained prime interest for training agents and reinforcing the behavior of the trained agents. In fact, researchers believe in the tremendous potential of reinforcement learning to address unsolved real world challenges like material discovery, space exploration, drug discovery etc and build much smarter artificial intelligence solutions. In this article, we will have a look at the most promising open source tools and libraries to start building your reinforcement learning projects on. OpenAI Gym OpenAI Gym, the most popular environment for developing and comparing reinforcement learning models, is completely compatible with high computational libraries like TensorFlow. The Python based rich AI simulation environment offers support for training agents on classic games like Atari as well as for other branches of science like robotics and physics such as Gazebo simulator and MuJoCo simulator. The Gym environment also offers APIs which facilitate feeding observations along with rewards back to agents. OpenAI has also recently released a new platform, Gym Retro made up of 58 varied and specific scenarios from Sonic the Hedgehog, Sonic the Hedgehog 2, and Sonic 3 games. Reinforcement learning enthusiasts and AI game developers can register for this competition. Read: How to build a cartpole game using OpenAI Gym TensorFlow This is an another well-known open-source library by Google followed by more than 95,000 developers everyday in areas of natural language processing, intelligent chatbots, robotics, and more. The TensorFlow community has developed an extended version called TensorLayer providing popular RL modules that can be easily customized and assembled for tackling real-world machine learning challenges. The TensorFlow community allows for the framework development in most popular languages such as Python, C, Java, JavaScript and Go. Google & its TensorFlow team are in the process of coming up with a Swift-compatible version to enable machine learning  on Apple environment. Read How to implement Reinforcement Learning with TensorFlow Keras Keras presents simplicity in implementing neural networks with just a few lines of codes with faster execution. It provides senior developers and principal scientists with a high-level interface to high tensor computation framework, TensorFlow and centralizes on the model architecture. So, if you have any existing RL models written in TensorFlow, just pick the Keras framework and you can transfer the learning to the related machine learning problem. DeepMind Lab DeepMind Lab is a Google 3D platform with customization for agent-based AI research. It is utilized to understand how self-sufficient artificial agents learn complicated tasks in large, partially observed environments. With the victory of its AlphaGo program against go players, in early 2016, DeepMind captured the public’s attention. With its three hubs spread across London, Canada and France, the DeepMind team is focussing on core AI fundamentals which includes building a single AI system backed by state-of-the-art methods and distributional reinforcement learning. To know more about how DeepMind Lab works, read How Google’s DeepMind is creating images with artificial intelligence. Pytorch Pytorch, open sourced by Facebook, is another well-known deep learning library adopted by many reinforcement learning researchers. It was recent preferred almost unanimously by top 10 finishers in Kaggle competition. With dynamic neural networks and strong GPU acceleration, Rl practitioners use it extensively to conduct experiments on implementing policy-based agent and to create new adventures. One crazy research project is Playing GridWorld, where Pytorch unchained its capabilities with renowned RL algorithms like policy gradient and simplified Actor-Critic method. Summing It Up There you have it, the top tools and libraries for reinforcement learning. The list doesn't end here, as there is a lot of work happening in developing platforms and libraries for scaling reinforcement learning. Frameworks like RL4J, RLlib are already in development and very soon would be full-fledged available for developers to simulate their models in their preferred coding language.
Read more
  • 0
  • 0
  • 22272

article-image-common-data-science-terms
Aarthi Kumaraswamy
16 May 2018
27 min read
Save for later

30 common data science terms explained

Aarthi Kumaraswamy
16 May 2018
27 min read
Let’s begin at the beginning. What do terms like statistical population, statistical comparison, statistical inference mean? What good is munging, coding, booting, regularization etc. On a scale of 1 to 30 (1 being the lowest and 30, the highest), rate yourself as a data scientist. No matter what you have scored yourself, we hope to have improved that score at least by a little, by the end of this post. Let’s start with a basic question: What is data science? [box type="shadow" align="" class="" width=""]The following is an excerpt from the book, Statistics for Data Science written by James D. Miller and published by Packt Publishing.[/box] The idea of how data science is defined is a matter of opinion. I personally like the explanation that data science is a progression or, even better, an evolution of thought or steps, as shown in the following figure: Although a progression or evolution implies a sequential journey, in practice, this is an extremely fluid process; each of the phases may inspire the data scientist to reverse and repeat one or more of the phases until they are satisfied. In other words, all or some phases of the process may be repeated until the data scientist determines that the desired outcome is reached. Depending on your sources and individual beliefs, you may say the following: Statistics is data science, and data science is statistics. Based upon personal experience, research, and various industry experts' advice, someone delving into the art of data science should take every opportunity to understand and gain experience as well as proficiency with the following list of common data science terms: Statistical population Probability False positives Statistical inference Regression Fitting Categorical data Classification Clustering Statistical comparison Coding Distributions Data mining Decision trees Machine learning Munging and wrangling Visualization D3 Regularization Assessment Cross-validation Neural networks Boosting Lift Mode Outlier Predictive modeling Big data Confidence interval Writing Statistical population You can perhaps think of a statistical population as a recordset (or a set of records). This set or group of records will be of similar items or events that are of interest to the data scientist for some experiment. For a data developer, a population of data may be a recordset of all sales transactions for a month, and the interest might be reporting to the senior management of an organization which products are the fastest sellers and at which time of the year. For a data scientist, a population may be a recordset of all emergency room admissions during a month, and the area of interest might be to determine the statistical demographics for emergency room use. [box type="note" align="" class="" width=""]Typically, the terms statistical population and statistical model are or can be used interchangeably. Once again, data scientists continue to evolve with their alignment on their use of common terms. [/box] Another key point concerning statistical populations is that the recordset may be a group of (actually) existing objects or a hypothetical group of objects. Using the preceding example, you might draw a comparison of actual objects as those actual sales transactions recorded for the month while the hypothetical objects as sales transactions are expected, forecast, or presumed (based upon observations or experienced assumptions or other logic) to occur during a month. Finally, through the use of statistical inference, the data scientist can select a portion or subset of the recordset (or population) with the intention that it will represent the total population for a particular area of interest. This subset is known as a statistical sample. If a sample of a population is chosen accurately, characteristics of the entire population (that the sample is drawn from) can be estimated from the corresponding characteristics of the sample. Probability Probability is concerned with the laws governing random events.                                           -www.britannica.com When thinking of probability, you think of possible upcoming events and the likelihood of them actually occurring. This compares to a statistical thought process that involves analyzing the frequency of past events in an attempt to explain or make sense of the observations. In addition, the data scientist will associate various individual events, studying the relationship of these events. How these different events relate to each other governs the methods and rules that will need to be followed when we're studying their probabilities. [box type="note" align="" class="" width=""]A probability distribution is a table that is used to show the probabilities of various outcomes in a sample population or recordset. [/box] False positives The idea of false positives is a very important statistical (data science) concept. A false positive is a mistake or an errored result. That is, it is a scenario where the results of a process or experiment indicate a fulfilled or true condition when, in fact, the condition is not true (not fulfilled). This situation is also referred to by some data scientists as a false alarm and is most easily understood by considering the idea of a recordset or statistical population (which we discussed earlier in this section) that is determined not only by the accuracy of the processing but by the characteristics of the sampled population. In other words, the data scientist has made errors during the statistical process, or the recordset is a population that does not have an appropriate sample (or characteristics) for what is being investigated. Statistical inference What developer at some point in his or her career, had to create a sample or test data? For example, I've often created a simple script to generate a random number (based upon the number of possible options or choices) and then used that number as the selected option (in my test recordset). This might work well for data development, but with statistics and data science, this is not sufficient. To create sample data (or a sample population), the data scientist will use a process called statistical inference, which is the process of deducing options of an underlying distribution through analysis of the data you have or are trying to generate for. The process is sometimes called inferential statistical analysis and includes testing various hypotheses and deriving estimates. When the data scientist determines that a recordset (or population) should be larger than it actually is, it is assumed that the recordset is a sample from a larger population, and the data scientist will then utilize statistical inference to make up the difference. [box type="note" align="" class="" width=""]The data or recordset in use is referred to by the data scientist as the observed data. Inferential statistics can be contrasted with descriptive statistics, which is only concerned with the properties of the observed data and does not assume that the recordset came from a larger population. [/box] Regression Regression is a process or method (selected by the data scientist as the best fit technique for the experiment at hand) used for determining the relationships among variables. If you're a programmer, you have a certain understanding of what a variable is, but in statistics, we use the term differently. Variables are determined to be either dependent or independent. An independent variable (also known as a predictor) is the one that is manipulated by the data scientist in an effort to determine its relationship with a dependent variable. A dependent variable is a variable that the data scientist is measuring. [box type="note" align="" class="" width=""]It is not uncommon to have more than one independent variable in a data science progression or experiment. [/box] More precisely, regression is the process that helps the data scientist comprehend how the typical value of the dependent variable (or criterion variable) changes when any one or more of the independent variables is varied while the other independent variables are held fixed. Fitting Fitting is the process of measuring how well a statistical model or process describes a data scientist's observations pertaining to a recordset or experiment. These measures will attempt to point out the discrepancy between observed values and probable values. The probable values of a model or process are known as a distribution or a probability distribution. Therefore, a probability distribution fitting (or distribution fitting) is when the data scientist fits a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. The object of a data scientist performing a distribution fitting is to predict the probability or to forecast the frequency of, the occurrence of the phenomenon at a certain interval. [box type="note" align="" class="" width=""]One of the most common uses of fitting is to test whether two samples are drawn from identical distributions.[/box] There are numerous probability distributions a data scientist can select from. Some will fit better to the observed frequency of the data than others will. The distribution giving a close fit is supposed to lead to good predictions; therefore, the data scientist needs to select a distribution that suits the data well. Categorical data Earlier, we explained how variables in your data can be either independent or dependent. Another type of variable definition is a categorical variable. This type of variable is one that can take on one of a limited, and typically fixed, number of possible values, thus assigning each individual to a particular category. Often, the collected data's meaning is unclear. Categorical data is a method that a data scientist can use to put meaning to the data. For example, if a numeric variable is collected (let's say the values found are 4, 10, and 12), the meaning of the variable becomes clear if the values are categorized. Let's suppose that based upon an analysis of how the data was collected, we can group (or categorize) the data by indicating that this data describes university students, and there is the following number of players: 4 tennis players 10 soccer players 12 football players Now, because we grouped the data into categories, the meaning becomes clear. Some other examples of categorized data might be individual pet preferences (grouped by the type of pet), or vehicle ownership (grouped by the style of a car owned), and so on. So, categorical data, as the name suggests, is data grouped into some sort of category or multiple categories. Some data scientists refer to categories as sub-populations of data. [box type="note" align="" class="" width=""]Categorical data can also be data that is collected as a yes or no answer. For example, hospital admittance data may indicate that patients either smoke or do not smoke. [/box] Classification Statistical classification of data is the process of identifying which category (discussed in the previous section) a data point, observation, or variable should be grouped into. The data science process that carries out a classification process is known as a classifier. Read this post: Classification using Convolutional Neural Networks [box type="note" align="" class="" width=""]Determining whether a book is fiction or non-fiction is a simple example classification. An analysis of data about restaurants might lead to the classification of them among several genres. [/box] Clustering Clustering is the process of dividing up the data occurrences into groups or homogeneous subsets of the dataset, not a predetermined set of groups as in classification (described in the preceding section) but groups identified by the execution of the data science process based upon similarities that it found among the occurrences. Objects in the same group (a group is also referred to as a cluster) are found to be more analogous (in some sense or another) to each other than to those objects found in other groups (or found in other clusters). The process of clustering is found to be very common in exploratory data mining and is also a common technique for statistical data analysis. Statistical comparison Simply put, when you hear the term statistical comparison, one is usually referring to the act of a data scientist performing a process of analysis to view the similarities or variances of two or more groups or populations (or recordsets). As a data developer, one might be familiar with various utilities such as FC Compare, UltraCompare, or WinDiff, which aim to provide the developer with a line-by-line comparison of the contents of two or more (even binary) files. In statistics (data science), this process of comparing is a statistical technique to compare populations or recordsets. In this method, a data scientist will conduct what is called an Analysis of Variance (ANOVA), compare categorical variables (within the recordsets), and so on. [box type="note" align="" class="" width=""]ANOVA is an assortment of statistical methods that are used to analyze the differences among group means and their associated procedures (such as variations among and between groups, populations, or recordsets). This method eventually evolved into the Six Sigma dataset comparisons. [/box] Coding Coding or statistical coding is again a process that a data scientist will use to prepare data for analysis. In this process, both quantitative data values (such as income or years of education) and qualitative data (such as race or gender) are categorized or coded in a consistent way. Coding is performed by a data scientist for various reasons such as follows: More effective for running statistical models Computers understand the variables Accountability--so the data scientist can run models blind, or without knowing what variables stand for, to reduce programming/author bias [box type="shadow" align="" class="" width=""]You can imagine the process of coding as the means to transform data into a form required for a system or application. [/box] Distributions The distribution of a statistical recordset (or of a population) is a visualization showing all the possible values (or sometimes referred to as intervals) of the data and how often they occur. When a distribution of categorical data (which we defined earlier in this chapter) is created by a data scientist, it attempts to show the number or percentage of individuals in each group or category. Linking an earlier defined term with this one, a probability distribution, stated in simple terms, can be thought of as a visualization showing the probability of occurrence of different possible outcomes in an experiment. Data mining With data mining, one is usually more absorbed in the data relationships (or the potential relationships between points of data, sometimes referred to as variables) and cognitive analysis. To further define this term, we can say that data mining is sometimes more simply referred to as knowledge discovery or even just discovery, based upon processing through or analyzing data from new or different viewpoints and summarizing it into valuable insights that can be used to increase revenue, cuts costs, or both. Using software dedicated to data mining is just one of several analytical approaches to data mining. Although there are tools dedicated to this purpose (such as IBM Cognos BI and Planning Analytics, Tableau, SAS, and so on.), data mining is all about the analysis process finding correlations or patterns among dozens of fields in the data and that can be effectively accomplished using tools such as MS Excel or any number of open source technologies. [box type="note" align="" class="" width=""]A common technique to data mining is through the creation of custom scripts using tools such as R or Python. In this way, the data scientist has the ability to customize the logic and processing to their exact project needs. [/box] Decision trees A statistical decision tree uses a diagram that looks like a tree. This structure attempts to represent optional decision paths and a predicted outcome for each path selected. A data scientist will use a decision tree to support, track, and model decision making and their possible consequences, including chance event outcomes, resource costs, and utility. It is a common way to display the logic of a data science process. Machine learning Machine learning is one of the most intriguing and exciting areas of data science. It conjures all forms of images around artificial intelligence which includes Neural Networks, Support Vector Machines (SVMs), and so on. Fundamentally, we can describe the term machine learning as a method of training a computer to make or improve predictions or behaviors based on data or, specifically, relationships within that data. Continuing, machine learning is a process by which predictions are made based upon recognized patterns identified within data, and additionally, it is the ability to continuously learn from the data's patterns, therefore continuingly making better predictions. It is not uncommon for someone to mistake the process of machine learning for data mining, but data mining focuses more on exploratory data analysis and is known as unsupervised learning. Machine learning can be used to learn and establish baseline behavioral profiles for various entities and then to find meaningful anomalies. Here is the exciting part: the process of machine learning (using data relationships to make predictions) is known as predictive analytics. Predictive analytics allow the data scientists to produce reliable, repeatable decisions and results and uncover hidden insights through learning from historical relationships and trends in the data. Munging and wrangling The terms munging and wrangling are buzzwords or jargon meant to describe one's efforts to affect the format of data, recordset, or file in some way in an effort to prepare the data for continued or otherwise processing and/or evaluations. With data development, you are most likely familiar with the idea of Extract, Transform, and Load (ETL). In somewhat the same way, a data developer may mung or wrangle data during the transformation steps within an ETL process. Common munging and wrangling may include removing punctuation or HTML tags, data parsing, filtering, all sorts of transforming, mapping, and tying together systems and interfaces that were not specifically designed to interoperate. Munging can also describe the processing or filtering of raw data into another form, allowing for more convenient consumption of the data elsewhere. Munging and wrangling might be performed multiple times within a data science process and/or at different steps in the evolving process. Sometimes, data scientists use munging to include various data visualization, data aggregation, training a statistical model, as well as much other potential work. To this point, munging and wrangling may follow a flow beginning with extracting the data in a raw form, performing the munging using various logic, and lastly, placing the resulting content into a structure for use. Although there are many valid options for munging and wrangling data, preprocessing and manipulation, a tool that is popular with many data scientists today is a product named Trifecta, which claims that it is the number one (data) wrangling solution in many industries. [box type="note" align="" class="" width=""]Trifecta can be downloaded for your personal evaluation from https://www.trifacta.com/. Check it out! [/box] Visualization The main point (although there are other goals and objectives) when leveraging a data visualization technique is to make something complex appear simple. You can think of visualization as any technique for creating a graphic (or similar) to communicate a message. Other motives for using data visualization include the following: To explain the data or put the data in context (which is to highlight demographic statistics) To solve a specific problem (for example, identifying problem areas within a particular business model) To explore the data to reach a better understanding or add clarity (such as what periods of time do this data span?) To highlight or illustrate otherwise invisible data (such as isolating outliers residing in the data) To predict, such as potential sales volumes (perhaps based upon seasonality sales statistics) And others Statistical visualization is used in almost every step in the data science process, within the obvious steps such as exploring and visualizing, analyzing and learning, but can also be leveraged during collecting, processing, and the end game of using the identified insights. D3 D3 or D3.js, is essentially an open source JavaScript library designed with the intention of visualizing data using today's web standards. D3 helps put life into your data, utilizing Scalable Vector Graphics (SVG), Canvas, and standard HTML. D3 combines powerful visualization and interaction techniques with a data-driven approach to DOM manipulation, providing data scientists with the full capabilities of modern browsers and the freedom to design the right visual interface that best depicts the objective or assumption. In contrast to many other libraries, D3.js allows inordinate control over the visualization of data. D3 is embedded within an HTML webpage and uses pre-built JavaScript functions to select elements, create SVG objects, style them, or add transitions, dynamic effects, and so on. Regularization Regularization is one possible approach that a data scientist may use for improving the results generated from a statistical model or data science process, such as when addressing a case of overfitting in statistics and data science. [box type="note" align="" class="" width=""]We defined fitting earlier (fitting describes how well a statistical model or process describes a data scientist's observations). Overfitting is a scenario where a statistical model or process seems to fit too well or appears to be too close to the actual data.[/box] Overfitting usually occurs with an overly simple model. This means that you may have only two variables and are drawing conclusions based on the two. For example, using our previously mentioned example of daffodil sales, one might generate a model with temperature as an independent variable and sales as a dependent one. You may see the model fail since it is not as simple as concluding that warmer temperatures will always generate more sales. In this example, there is a tendency to add more data to the process or model in hopes of achieving a better result. The idea sounds reasonable. For example, you have information such as average rainfall, pollen count, fertilizer sales, and so on; could these data points be added as explanatory variables? [box type="note" align="" class="" width=""]An explanatory variable is a type of independent variable with a subtle difference. When a variable is independent, it is not affected at all by any other variables. When a variable isn't independent for certain, it's an explanatory variable. [/box] Continuing to add more and more data to your model will have an effect but will probably cause overfitting, resulting in poor predictions since it will closely resemble the data, which is mostly just background noise. To overcome this situation, a data scientist can use regularization, introducing a tuning parameter (additional factors such as a data points mean value or a minimum or maximum limitation, which gives you the ability to change the complexity or smoothness of your model) into the data science process to solve an ill-posed problem or to prevent overfitting. Assessment When a data scientist evaluates a model or data science process for performance, this is referred to as assessment. Performance can be defined in several ways, including the model's growth of learning or the model's ability to improve (with) learning (to obtain a better score) with additional experience (for example, more rounds of training with additional samples of data) or accuracy of its results. One popular method of assessing a model or processes performance is called bootstrap sampling. This method examines performance on certain subsets of data, repeatedly generating results that can be used to calculate an estimate of accuracy (performance). The bootstrap sampling method takes a random sample of data, splits it into three files--a training file, a testing file, and a validation file. The model or process logic is developed based on the data in the training file and then evaluated (or tested) using the testing file. This tune and then test process is repeated until the data scientist is comfortable with the results of the tests. At that point, the model or process is again tested, this time using the validation file, and the results should provide a true indication of how it will perform. [box type="note" align="" class="" width=""]You can imagine using the bootstrap sampling method to develop program logic by analyzing test data to determine logic flows and then running (or testing) your logic against the test data file. Once you are satisfied that your logic handles all of the conditions and exceptions found in your testing data, you can run a final test on a new, never-before-seen data file for a final validation test. [/box] Cross-validation Cross-validation is a method for assessing a data science process performance. Mainly used with predictive modeling to estimate how accurately a model might perform in practice, one might see cross-validation used to check how a model will potentially generalize, in other words, how the model can apply what it infers from samples to an entire population (or recordset). With cross-validation, you identify a (known) dataset as your validation dataset on which training is run along with a dataset of unknown data (or first seen data) against which the model will be tested (this is known as your testing dataset). The objective is to ensure that problems such as overfitting (allowing non-inclusive information to influence results) are controlled and also provide an insight into how the model will generalize a real problem or on a real data file. The cross-validation process will consist of separating data into samples of similar subsets, performing the analysis on one subset (called the training set) and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple iterations (also called folds or rounds) of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Typically, a data scientist will use a models stability to determine the actual number of rounds of cross-validation that should be performed. Neural networks Neural networks are also called artificial neural networks (ANNs), and the objective is to solve problems in the same way that the human brain would. Google will provide the following explanation of ANN as stated in Neural Network Primer: Part I, by Maureen Caudill, AI Expert, Feb. 1989: [box type="note" align="" class="" width=""]A computing system made up of several simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. [/box] To oversimplify the idea of neural networks, recall the concept of software encapsulation, and consider a computer program with an input layer, a processing layer, and an output layer. With this thought in mind, understand that neural networks are also organized in a network of these layers, usually with more than a single processing layer. Patterns are presented to the network by way of the input layer, which then communicates to one (or more) of the processing layers (where the actual processing is done). The processing layers then link to an output layer where the result is presented. Most neural networks will also contain some form of learning rule that modifies the weights of the connections (in other words, the network learns which processing nodes perform better and gives them a heavier weight) per the input patterns that it is presented with. In this way (in a sense), neural networks learn by example as a child learns to recognize a cat from being exposed to examples of cats. Boosting In a manner of speaking, boosting is a process generally accepted in data science for improving the accuracy of a weak learning data science process. [box type="note" align="" class="" width=""]Data science processes defined as weak learners are those that produce results that are only slightly better than if you would randomly guess the outcome. Weak learners are basically thresholds or a 1-level decision tree. [/box] Specifically, boosting is aimed at reducing bias and variance in supervised learning. What do we mean by bias and variance? Before going on further about boosting, let's take note of what we mean by bias and variance. Data scientists describe bias as a level of favoritism that is present in the data collection process, resulting in uneven, disingenuous results and can occur in a variety of different ways. A sampling method is called biased if it systematically favors some outcomes over others. A variance may be defined (by a data scientist) simply as the distance from a variable mean (or how far from the average a result is). The boosting method can be described as a data scientist repeatedly running through a data science process (that has been identified as a weak learning process), with each iteration running on different and random examples of data sampled from the original population recordset. All the results (or classifiers or residue) produced by each run are then combined into a single merged result (that is a gradient). This concept of using a random subset of the original recordset for each iteration originates from bootstrap sampling in bagging and has a similar variance-reducing effect on the combined model. In addition, some data scientists consider boosting a means to convert weak learners into strong ones; in fact, to some, the process of boosting simply means turning a weak learner into a strong learner. Lift In data science, the term lift compares the frequency of an observed pattern within a recordset or population with how frequently you might expect to see that same pattern occur within the data by chance or randomly. If the lift is very low, then typically, a data scientist will expect that there is a very good probability that the pattern identified is occurring just by chance. The larger the lift, the more likely it is that the pattern is real. Mode In statistics and data science, when a data scientist uses the term mode, he or she refers to the value that occurs most often within a sample of data. Mode is not calculated but is determined manually or through processing of the data. Outlier Outliers can be defined as follows: A data point that is way out of keeping with the others That piece of data that doesn't fit Either a very high value or a very low value Unusual observations within the data An observation point that is distant from all others Predictive modeling The development of statistical models and/or data science processes to predict future events is called predictive modeling. Big Data Again, we have some variation of the definition of big data. A large assemblage of data, data sets that are so large or complex that traditional data processing applications are inadequate, and data about every aspect of our lives have all been used to define or refer to big data. In 2001, then Gartner analyst Doug Laney introduced the 3V's concept. The 3V's, as per Laney, are volume, variety, and velocity. The V's make up the dimensionality of big data: volume (or the measurable amount of data), variety (meaning the number of types of data), and velocity (referring to the speed of processing or dealing with that data). Confidence interval The confidence interval is a range of values that a data scientist will specify around an estimate to indicate their margin of error, combined with a probability that a value will fall in that range. In other words, confidence intervals are good estimates of the unknown population parameter. Writing Although visualizations grab much more of the limelight when it comes to presenting the output or results of a data science process or predictive model, writing skills are still not only an important part of how a data scientist communicates but still considered an essential skill for all data scientists to be successful. Did we miss any of your favorite terms? Now that you are at the end of this post, we ask you again: On a scale of 1 to 30 (1 being the lowest and 30, the highest), how do you rate yourself as a data scientist? Why You Need to Know Statistics To Be a Good Data Scientist [interview] How data scientists test hypotheses and probability 6 Key Areas to focus on while transitioning to a Data Scientist role Soft skills every data scientist should teach their child
Read more
  • 0
  • 0
  • 7492
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-what-can-google-duplex-do-for-businesses
Natasha Mathur
16 May 2018
9 min read
Save for later

What can Google Duplex do for businesses?

Natasha Mathur
16 May 2018
9 min read
When talking about the capabilities of AI-driven digital assistants, the most talked about issue is their inability to converse in a way a real human does. The robotic tone of the virtual assistants has been limiting them from imitating real humans for a long time. And it’s not just the flat monotone. It’s about understanding the nuances of the language, pitches, intonations, sarcasm, and a lot more. Now, what if there emerges a technology that is capable of sounding and behaving almost human? Well, look no further, Google Duplex is here to dominate the world of digital assistants. Google introduced the new Duplex at Google I/O 2018, their annual developer conference, last week. But, what exactly is it? Google Duplex is a newly added feature to the famed Google assistant. Adding to the capabilities of Google assistant, it is also able to make phone calls for the users, and imitate human natural conversation almost perfectly to get the day-to-day tasks ( such as booking table reservations, hair salon appointments, etc. ) done in an easy manner. It includes pause-fillers and phrases such as “um”, “uh-huh “, and “erm” to make the conversation sound as natural as possible. Don’t believe me? Check out the audio yourself! [audio mp3="https://hub.packtpub.com/wp-content/uploads/2018/05/Google-Duplex-hair-salon.mp3"][/audio]  Google Duplex booking appointments at a hair salon [audio mp3="https://hub.packtpub.com/wp-content/uploads/2018/05/Google-Duplex-table-reservation.mp3"][/audio]  Google Duplex making table reservations at a restaurant The demo call recording video of the assistant and the business employee, presented by Sundar Pichai, Google’s CEO, during the opening keynote, befuddled the entire world about who’s the assistant and who’s the human, making it go noticeably viral. A lot of questions are buzzing around whether Google Duplex just passed the Turing Test. The Turing Test assesses a machine’s ability to present intelligence closer or equivalent to that of a human being. Did the new human sounding robot assistant pass the Turing test yet? No, but it’s certainly the voice AI that has come closest to passing it. Now how does Google Duplex work? It’s quite simple. Google Duplex finds out the information ( you need ) that isn’t out there on the internet by making a direct phone call. For instance, a restaurant has shifted location and the new address is nowhere to be found online. Google Duplex will call the restaurant and check on their new address for you. The system comes with a self-monitoring capability, helping it recognize complex tasks that it cannot accomplish on its own. Such cases are signaled to a human operator, who then takes care of the task. To get a bit technical, Google Duplex makes use of Recurrent Neural Networks ( RNNs ) which are created using TensorFlow extended ( TFX ), a machine learning platform. Duplex’s RNNs are trained using the data anonymization technique on phone conversation data. Data anonymization helps with protecting the identity of a company or an individual by removing the data sets related to them. The output of Google’s Automatic speech recognition technology, conversation history and different parameters of the conversation are used by the network. The model also makes use of hyperparameter optimization from TFX which further enhances the model. But, how does it sound natural? Google uses concatenative text to speech ( TTS ) along with synthesis TTS engine ( using Tacotron and WaveNet ) to control the intonation depending on different circumstances. Concatenative TTS is a technique that converts normal text into speech by concatenating or linking together the recorded speech pieces. Synthesis TTS engine helps developers modify the speech rate, volume, and pitch of the synthesized output. Including speech disfluencies ( “hmm”s, “erm”s, and “uh”s ) makes the Duplex sound more human. These speech disfluencies are added when very different sound units are combined in the concatenative TTS or adding synthetic waits. This allows the system to signal in a natural way that it is still processing ( equivalent to what humans do when trying to sort out their thoughts ). Also, the delay or latency should match people’s expectations. Duplex is capable of figuring out when to give slow or fast responses using low-confidence models or faster approximations. Google also found out that including more latency helps with making the conversation sound more natural. Some potential applications of Google Duplex for businesses Now that we’ve covered the what and how of this new technology, let’s look at five potential applications of Google Duplex in the immediate future. Customer Service Basic forms of AI using natural language processing ( NLP ), such as chatbots and the existing voice assistants such as Siri and Alexa are already in use within the customer care industry. Google Duplex paves the way for an even more interactive form of engaging customers and gaining information, given its spectacular human sounding capability. According to Gartner, “By 2018, 30% of our interactions with technology will be through "conversations" with smart machines”. With Google Duplex, being the latest smart machine introduced to the world, the basic operations of the customer service industry will become easier, more manageable and efficient. From providing quick solutions to the initial customer support problems and delivering internal services to the employees, Google Duplex perfectly fills the bill. And it will only get better with further advances in NLP. So far chatbots and digital assistants have been miserable at handling irate customers. I can imagine Google Duplex in John Legend’s smooth voice calming down an angry customer or even making successful sales pitches to potential leads with all its charm and suave! Of course, Duplex must undergo the right customer management training with a massive amount of quality data on what good and bad handling look like before it is ready for such a challenge. Other areas of customer service where Google Duplex can play a major role is in IT support. Instead of connecting with the human operator, the user will first get connected to Google Duplex. Thus, making the entire experience friendly and personalized from the user perspective and saving major costs for organizations. HR Department Google Duplex can also extend a helping hand in the HR department. The preliminary rounds of talent acquisition where hiring executives make phone calls to their respective candidates could be handled by Google Duplex provided it gets the right training. Making note of the basic qualifications, candidate details, and scheduling interviews are all the functions that Google Duplex should be able to do effectively. The Google Assistant can collect the information and then further rounds can be conducted by the human HR personnel. This could greatly cut down on the time expended by HR executives on the first few rounds of shortlisting. This means they are free to focus their time on other strategically important areas of hiring. Personal assistants and productivity As presented at Google I/O 2018, Google Duplex is capable of booking appointments at hair salons, booking table reservations and finding out holiday hours over the phone. It is not a stretch to therefore assume that it can also order takeaway food over a phone call, check with the delivery man regarding the order, cancel appointments, make business inquiries, etc. Apart from that, it’s a great aid for people with hearing loss issues as well as people who do not speak the local language by allowing them to carry out tasks on phone. Healthcare Industry There is already enough talk surrounding the use of Alexa, Siri, and other voice assistants in healthcare. Google Duplex is another new addition to the family. With its natural way of conversing, Duplex can: Let patients know their wait time for emergency rooms. Check with the hospital regarding their health appointments. Order the necessary equipment for hospital use. Another allied area is elder care. Google Duplex could help reduce ailments related to loneliness by engaging with the users at a more human level. It could also assist with preventive care and in the management of lifestyle diseases such as diabetes by ensuring patients continue their med intake, keep their appointments, provide emergency first aid help, call 911 etc. Real Estate Industry Duplex enabled Google Assistants will help make realtors’ task easy. Duplex can help call potential sellers and buyers, thereby, making it easy for realtors to select the respective customers. The conversation between Google Duplex ( helping a realtor ) and a customer wanting to buy a house can look something like this: Google Duplex: Hi! I heard you are house hunting. Are you looking to buy or sell a property? Customer: Hey, I’m looking to buy a home in the Washington area. Google Duplex: That’s great! What part of Washington are you looking in for? Customer:  I’m looking for a house in Seattle. 3 bedrooms and 3 baths would be fine. Google Duplex: Sure, umm, may I know your budget? Customer: Somewhere between $749,000 to $850,000, is that fine? Google Duplex: Ahh okay sure, I’ve made a note and I’ll call you once I find the right matches. Customer: Yeah, sure. Google Duplex: okay, thanks. Customer: Thanks, Bye! Google Duplex then makes a note of the details on the realtor’s phone, thereby, narrowing down the efforts made by realtors on cold calling the potential sellers to a great extent. At the same time, the broker will also receive an email with the consumer’s details and contact information for a follow-up. Every rose has its thorns. What’s Duplex’s thorny issue? With all the good hype surrounding Google Duplex, there have been some controversies regarding the ethicality of Google Duplex. Some people have questions and mixed reactions about Google Duplex fooling people of one’s identity as the voice of the Duplex differs significantly from that of a robot. A lot of talk surrounding this issue is trending on several twitter threads. It has hushed away these questions by saying how ‘transparency in technology’ is important and they are ‘designing this feature with disclosure built-in’ which will help in identifying the system. Google also mentioned how any feedback that people have regarding their new product. Google successfully managed to awe people across the globe with their new and innovative Google Duplex. But there is a still a long way to go even though Google has already taken a step ahead in an effort to better the human relationships with the machines. If you enjoyed reading this article and want to know more, check out the official Google Duplex blog post. Google’s Android Things, developer preview 8: First look Google News’ AI revolution strikes balance between personalization and the bigger picture Android P new features: artificial intelligence, digital wellbeing, and simplicity  
Read more
  • 0
  • 0
  • 5454

article-image-6-reasons-to-choose-mysql-8-for-designing-database-solutions
Amey Varangaonkar
08 May 2018
4 min read
Save for later

6 reasons to choose MySQL 8 for designing database solutions

Amey Varangaonkar
08 May 2018
4 min read
Whether you are a standalone developer or an enterprise consultant, you would obviously choose a database that provides good benefits and results when compared to other related products. MySQL 8 provides numerous advantages as the first choice in this competitive market. It has various powerful features available that make it a more comprehensive database. Today we will go through the benefits of using MySQL as the preferred database solution: [box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, co-authored by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. This book presents step-by-step techniques on managing, monitoring and securing the MySQL database without any hassle.[/box] Security The first thing that comes to mind is securing data because nowadays data has become precious and can impact business continuity if legal obligations are not met; in fact, it can be so bad that it can close down your business in no time. MySQL is the most secure and reliable database management system used by many well-known enterprises such as Facebook, Twitter, and Wikipedia. It really provides a good security layer that protects sensitive information from intruders. MySQL gives access control management so that granting and revoking required access from the user is easy. Roles can also be defined with a list of permissions that can be granted or revoked for the user. All user passwords are stored in an encrypted format using plugin-specific algorithms. Scalability Day by day, the mountain of data is growing because of extensive use of technology in numerous ways. Because of this, load average is going through the roof. In some cases, it is unpredictable that data cannot exceed up to some limit or number of users will not go out of bounds. Scalable databases would be a preferable solution so that, at any point, we can meet unexpected demands to scale. MySQL is a rewarding database system for its scalability, which can scale horizontally and vertically; in terms of data, spreading database and load of application queries across multiple MySQL servers is quite feasible. It is pretty easy to add horsepower to the MySQL cluster to handle the load. An open source relational database management system MySQL is an open source database management system that makes debugging, upgrading, and enhancing the functionality fast and easy. You can view the source and make the changes accordingly and use it in your own way. You can also distribute an extended version of MySQL, but you will need to have a license for this. High performance MySQL gives high-speed transaction processing with optimal speed. It can cache the results, which boosts read performance. Replication and clustering make the  system scalable for more concurrency and manages the heavy workload. Database indexes also accelerate the performance of SELECT query statements for substantial amount of data. To enhance performance, MySQL 8 has included indexes in performance schema to speed up data retrieval. High availability Today, in the world of competitive marketing, an organization's key point is to have their system up and running. Any failure or downtime directly impacts business and revenue; hence, high availability is a factor that cannot be overlooked. MySQL is quite reliable and has constant availability using cluster and replication configurations. Cluster servers instantly handle failures and manage the failover part to keep your system available almost all the time. If one  server gets down, it will redirect the user's request to another node and perform the requested operation. Cross-platform capabilities MySQL provides cross-platform flexibility that can run on various platforms such as Windows, Linux, Solaris, OS2, and so on. It has great API support  for the all  major languages, which makes it very easy to integrate with languages such as  PHP, C++, Perl,  Python, Java, and so on. It is also part of the Linux Apache  MySQL PHP (LAMP) server that is used worldwide for web applications. That’s it then! We discussed few important reasons of MySQL being the most popular relational database in the world and is widely adopted across many enterprises. If you want to learn more about MySQL’s administrative features, make sure to check out the book MySQL 8 Administrator’s Guide today! 12 most common MySQL errors you should be aware of Top 10 MySQL 8 performance benchmarking aspects to know
Read more
  • 0
  • 0
  • 3940

article-image-2018-year-of-graph-databases
Amey Varangaonkar
04 May 2018
5 min read
Save for later

2018 is the year of graph databases. Here's why.

Amey Varangaonkar
04 May 2018
5 min read
With the explosion of data, businesses are looking to innovate as they connect their operations to a whole host of different technologies. The need for consistency across all data elements is now stronger than ever. That’s where graph databases come in handy. Because they allow for a high level of flexibility when it comes to representing your data and also while handling complex interactions within different elements, graph databases are considered by many to be the next big trend in databases. In this article, we dive deep into the current graph database scene, and list out 3 top reasons why graph databases will continue to soar in terms of popularity in 2018. What are graph databases, anyway? Simply put, graph databases are databases that follow the graph model. What is a graph model, then? In mathematical terms, a graph is simply a collection of nodes, with different nodes connected by edges. Each node contains some information about the graph, while edges denote the connection between the nodes. How are graph databases different from the relational databases, you might ask? Well, the key difference between the two is the fact that graph data models allow for more flexible and fine-grained relationships between data objects, as compared to relational models. There are some more differences between the graph data model and the relational data model, which you should read through for more information. Often, you will see that graph databases are without a schema. This allows for a very flexible data model, much like the document or key/value store database models. A unique feature of the graph databases, however, is that they also support relationships between the data objects like a relational database. This is useful because it allows for a more flexible and faster database, which can be invaluable to your project which demands a quicker response time. Image courtesy DB-Engines The rise in popularity of the graph database models over the last 5 years has been stunning, but not exactly surprising. If we were to drill down the 3 key factors that have propelled the popularity of graph databases to a whole new level, what would they be? Let’s find out. Major players entering the graph database market About a decade ago, the graph database family included just Neo4j and a couple of other less-popular graph databases. More recently, however, all the major players in the industry such as Oracle (Oracle Spatial and Graph), Microsoft (Graph Engine), SAP (SAP Hana as a graph store) and IBM (Compose for JanusGraph) have come up with graph offerings of their own. The most recent entrant to the graph database market is Amazon, with Amazon Neptune announced just last year. According to Andy Jassy, CEO of Amazon Web Services, graph databases are becoming a part of the growing trend of multi-model databases. Per Jassy, these databases are finding increased adoption on the cloud as they support a myriad of useful data processing methods. The traditional over-reliance on relational databases is slowly breaking down, he says. Rise of the Cypher Query Language With graph databases slowly getting mainstream recognition and adoption, the major companies have identified the need for a standard query language for all graph databases. Similar to SQL, Cypher has emerged as a standard and is a widely-adopted alternative to write efficient and easy to understand graph queries. As of today, the Cypher Query Language is used in popular graph databases such as Neo4j, SAP Hana, Redis graph and so on. The OpenCypher project, the project that develops and maintains Cypher, has also released Cypher for popular Big Data frameworks like Apache Spark. Cypher’s popularity has risen tremendously over the last few years. The primary reason for this is the fact that like SQL, Cypher’s declarative nature allows users to state the actions they want performed on their graph data without explicitly specifying them. Finding critical real-world applications Graph databases were in the news as early as 2016, when the Panama paper leaks were revealed with the help of Neo4j and Linkurious, a data visualization software. In more recent times, graph databases have also found increased applications in online recommendation engines, as well as for performing tasks that include fraud detection and managing social media. Facebook’s search app also uses graph technology to map social relationships. Graph databases are also finding applications in virtual assistants to drive conversations - eBay’s virtual shopping assistant is an example. Even NASA uses the knowledge graph architecture to find critical data. What next for graph databases? With growing adoption of graph databases, we expect graph-based platforms to soon become the foundational elements of many corporate tech stacks. The next focus area for these databases will be practical implementations such as graph analytics and building graph-based applications. The rising number of graph databases would also mean more competition, and that is a good thing - competition will bring more innovation, and enable incorporation of more cutting-edge features. With a healthy and steadily growing community of developers, data scientists and even business analysts, this evolution may be on the cards, sooner than we might expect. Amazon Neptune: A graph database service for your applications When, why and how to use Graph analytics for your big data
Read more
  • 0
  • 0
  • 8588

article-image-why-is-data-science-important
Richard Gall
24 Apr 2018
3 min read
Save for later

Why is data science important?

Richard Gall
24 Apr 2018
3 min read
Is data science important? It's a term that's talked about a lot but often misunderstood. Because it's a buzzword it's easy to dismiss; but data science is important. Behind the term lies very specific set of activities - and skills - that businesses can leverage to their advantage. Data science allows businesses to use the data at their disposal, whether that's customer data, financial data or otherwise, in an intelligent manner. It's results should be a key driver of growth. However, although it’s not wrong to see data science as a real game changer for business, that doesn’t mean it’s easy to do well. In fact, it’s pretty easy to do data science badly. A number of reports suggest that a large proportion of analytics projects fail to deliver results. That means a huge number of organizations are doing data science wrong. Key to these failures is a misunderstanding of how to properly utilize data science. You see it so many times - buzzwords like data science are often like hammers. They make all your problems look like nails. And not properly understanding the business problems you’re trying to solve is where things go wrong. What is data science? But what is data science exactly? Quite simply, it’s about using data to solve problems. The scope of these problems is huge. Here are a few ways data science can be used: Improving customer retention by finding out what the triggers of churn might be Improving internal product development processes by looking at points where faults are most likely to happen Targeting customers with the right sales messages at the right time Informing product development by looking at how people use your products Analyzing customer sentiment on social media Financial modeling As you can see data science is a field that can impact every department. From marketing to product management to finance, data science isn’t just a buzzword, it’s a shift in mindset about how we work. Data science is about solving business problems To anyone still asking is data science important, the answer is actually quite straightforward. It's important because it solves business problems. Once you - and management - recognise that fact, you're on the right track. Too often businesses want machine learning, big data projects without thinking about what they’re really trying to do. If you want your data scientists to be successful, present them with the problems - let them create the solutions. They won’t want to be told to simply build a machine learning project. It’s crucial to know what the end goal is. Peter Drucker once said “in God we trust… everyone else must bring data”. But data science didn’t really exist then - if it did it could be much simpler: trust your data scientists.
Read more
  • 0
  • 0
  • 11406
article-image-why-is-hadoop-dying
Aaron Lazar
23 Apr 2018
5 min read
Save for later

Why is Hadoop dying?

Aaron Lazar
23 Apr 2018
5 min read
Hadoop has been the definitive big data platform for some time. The name has practically been synonymous with the field. But while its ascent followed the trajectory of what was referred to as the 'big data revolution', Hadoop now seems to be in danger. The question is everywhere - is Hadoop dying out? And if it is, why is it? Is it because big data is no longer the buzzword it once was, or are there simply other ways of working with big data that have become more useful? Hadoop was essential to the growth of big data When Hadoop was open sourced in 2007, it opened the door to big data. It brought compute to data, as against bringing data to compute. Organisations had the opportunity to scale their data without having to worry too much about the cost. It obviously had initial hiccups with security, the complexity of querying and querying speeds, but all that was taken care off, in the long run. Still, although querying speeds remained quite a pain, however that wasn’t the real reason behind Hadoop dying (slowly). As cloud grew, Hadoop started falling One of the main reasons behind Hadoop's decline in popularity was the growth of cloud. There cloud vendor market was pretty crowded, and each of them provided their own big data processing services. These services all basically did what Hadoop was doing. But they also did it in an even more efficient and hassle-free way. Customers didn't have to think about administration, security or maintenance in the way they had to with Hadoop. One person’s big data is another person’s small data Well, this is clearly a fact. Several organisations that used big data technologies without really gauging the amount of data they actually would need to process, have suffered. Imagine sitting with 10TB Hadoop clusters when you don’t have that much data. The two biggest organisations that built products on Hadoop, Hortonworks and Cloudera, saw a decline in revenue in 2015, owing to their massive use of Hadoop. Customers weren’t pleased with nature of Hadoop’s limitations. Apache Hadoop v Apache Spark Hadoop processing is way behind in terms of processing speed. In 2014 Spark took the world by storm. I’m going to let you guess which line in the graph above might be Hadoop, and which might be Spark. Spark was a general purpose, easy to use platform that was built after studying the pitfalls of Hadoop. Spark was not bound to just the HDFS (Hadoop Distributed File System) which meant that it could leverage storage systems like Cassandra and MongoDB as well. Spark 2.3 was also able to run on Kubernetes; a big leap for containerized big data processing in the cloud. Spark also brings along GraphX, which allows developers to view data in the form of graphs. Some of the major areas Spark wins are Iterative Algorithms in Machine Learning, Interactive Data Mining and Data Processing, Stream processing, Sensor data processing, etc. Machine Learning in Hadoop is not straightforward Unlike MLlib in Spark, Machine Learning is not possible in Hadoop unless tied with a 3rd party library. Mahout used to be quite popular for doing ML on Hadoop, but its adoption has gone down in the past few years. Tools like RHadoop, a collection of 3 R packages, have grown for ML, but it still is nowhere comparable to the power of the modern day MLaaS offerings from cloud providers. All the more reason to move away from Hadoop, right? Maybe. Hadoop is not only Hadoop The general misconception is that Hadoop is quickly going to be extinct. On the contrary, the Hadoop family consists of YARN, HDFS, MapReduce, Hive, Hbase, Spark, Kudu, Impala, and 20 other products. While e folks may be moving away from Hadoop as their choice for big data processing, they will still be using Hadoop in some form or the other. As with Cloudera and Hortonworks, though the market has seen a downward trend, they’re in no way letting go of Hadoop anytime soon, although they have shifted part of their processing operations to Spark. Is Hadoop dying? Perhaps not... In the long run, it’s not completely accurate to say that Hadoop is dying. December last year brought with it Hadoop 3.0, which is supposed to be a much improved version of the framework. Some of the most noteworthy features are its improved shell script, more powerful YARN, improved fault tolerance with erasure coding, and many more. Although, that hasn’t caused any major spike in adoption, there are still users who will adopt Hadoop based on their use case, or simply use another alternative like Spark along with another framework from the Hadoop family. So, Hadoop’s not going away anytime soon. Read More Pandas is an effective tool to explore and analyze data - Interview insights  
Read more
  • 0
  • 1
  • 16317

article-image-what-is-aiops-why-going-to-be-important
Aaron Lazar
19 Apr 2018
4 min read
Save for later

What is AIOps and why is it going to be important?

Aaron Lazar
19 Apr 2018
4 min read
Woah, woah, woah! Wait a minute! First there was that game SpecOps that I usually sucked at, then there came ITOps and DevOps that took the world by storm, now there’s another something-Ops?? Well, believe it or not, there is, and they’re calling it AIOps. What does AIOps stand for? AIOps basically means Artificial Intelligence for IT Operations. It means IT operations are enhanced by using analytics and machine learning to analyze the data that’s collected from various IT operations tools and devices. This helps in spotting and reacting to issues in real time. Coined by Gartner, the term has grown in popularity over the past year. Gartner believes that AIOps will be a major transformation for ITOps professionals mainly due to the fact that traditional IT operations cannot cope with the modern digital transformation. Why is AIOps important? With the massive and rapid shift towards cloud adoption, automation and continuous improvement, AIOps is here to take care of the new entrants into the digital ecosystem - Machine agents, artificial intelligence, IoT devices, etc. These new entrants are impossible to service and maintain by humans and with billions of devices connected together, the only way forward is to employ algorithms that tackle known problems. Some of the solutions it provides are maintaining high availability and monitoring performance, event correlation and analysis, automation and IT service management. How does AIOps work? As depicted in Gartner’s diagram, there are two primary components to AIOps. Big Data Machine Learning Data is gathered from the enterprise. You then implement a comprehensive analytics and machine learning strategy alongside the combined IT data (monitoring data + job logs + tickets + incident logs). The processed data yields continuous insights, continuous improvements and fixes. It bridges three different IT disciplines to accomplish its goals: Service management Performance management, and Automation To put it simply, it is a strategic focus. It argues for a new approach in a world where big data and machine learning have changed everything. How to move from ITOps to AIOps Machine Learning Most of AIOps will involve supervised learning and professionals will need a good understanding of the underlying algorithms. Now don’t get me wrong, they don’t need to be full blown data scientists to build the system, but just having sufficient knowledge to be able to train the system to pick up anomalies. Auditing these systems to ensure they’re performing the tasks as per the initial vision is necessary and this will go hand in hand with scripting them. Understanding modern application technologies With the rise of Agile software development and other modern methodologies, AIOps professionals are expected to know all about microservices, APIs, CI/CD, containers, etc. With the giant leaps that cloud development is taking, it is expected to gain visibility into cloud deployments, with an emphasis on cost and performance. Security Security is critical, for example, it’s important for personnel to understand how to engage a denial of service attack or maybe a ransomware attack, like the ones we’ve seen in the recent past. Training machines to detect/predict such events is pertinent to AIOps. The key tools in AIOps There are a wide variety of AIOps platforms available in the market that bring AI and Intelligence to IT Operations. One of the most noteworthy ones is Splunk, which has recently incorporated AI for intelligence driven operations. Another one is the Moogsoft AIOps platform, that is quite similar to Splunk. BMC also has entered the fray, launching TrueSight 11, their AIOps platform that promises to address use cases to improve performance and capacity management, the service desk, and application development disciplines. Gartner has a handy list of top platforms. If you’re planning the transition from ITOps, do check out the list. Companies like Frankfurt Cargo Services and Revtrak have already added the AI to their Ops. So, are you going to make the transition? According to Gartner, 40% of large enterprises would have made the transition to AIOps by 2022. If you’re one of them, I recommend you do it for the right reasons, but don’t do it overnight. The transition needs to be gradual and well planned. The first thing you need to do is getting your enterprise data together. If you don’t have sufficient data that’s worthy of analysis, AIOps isn’t going to help you much. Read more: Bridging the gap between data science and DevOps with DataOps.
Read more
  • 0
  • 0
  • 5912

article-image-how-machine-learning-as-a-service-transforming-cloud
Vijin Boricha
18 Apr 2018
4 min read
Save for later

How machine learning as a service is transforming cloud

Vijin Boricha
18 Apr 2018
4 min read
Machine learning as a service (MLaaS) is an innovation that is growing out of 2 of the most important tech trends - cloud and machine learning. It's significant because it enhances both. It makes cloud an even more compelling proposition for businesses. That's because cloud typically has three major operations: computing, networking and storage. When you bring machine learning into the picture, the data that cloud stores and processes can be used in radically different ways, solving a range of business problems. What is machine learning as a service? Cloud platforms have always competed to be the first or the best to provide new services. This includes platform as a service (PaaS) solutions, infrastructure as a service (IaaS) solutions and software as a service (SaaS) solutions. In essense, cloud providers like AWS and Azure provide sets of software to different things so their customers don't have to. Machine learning as a service is simply another instance of the services offered by cloud providers. It could include a wide range of features, from data visualization to predictive analytics and natural language processing. It makes running machine learning models easy, effectively automating some of the work that might have typically done manually by a data engineering team. Here are the biggest cloud providers who offer machine learning as a service: Google Cloud Platform Amazon Web Services Microsoft Azure IBM Cloud Every platform provides a different suite of services and features. It will ultimately depend on what's most important to you which one you choose. Let's take a look now at the key differences between these cloud providers' machine learning as a service offerings. Comparing the leading MLaaS products Google Cloud AI Google Cloud Platform has always provided their own services to help businesses grow. They provide modern machine learning services with pre-trained models and a service to generate your own tailored models. Majority of Google applications like Photos (image search), the Google app (voice search), and Inbox (Smart Reply) have been built using the same services that they provide to their users. Pros: Cheaper in comparison to other Cloud providers Provides IaaS and PaaS Solutions Cons: Google Prediction API is going to be discontinued (May 1st, 2018) Lacks a visual interface You'll need to know TensorFlow Amazon Machine Learning Amazon Machine Learning provides services for building ML models and generating predictions which help users develop robust, scalable, and cost-effective smart applications. With the help of Amazon Machine Learning you are able to use powerful machine learning technology without having any prior experience in machine learning algorithms and techniques. Pros: Provides versatile automated solutions It's accessible - users don't need to be machine learning experts Cons: The more you use, the more expensive it is Azure Machine Learning Studio Microsoft Azure provides you with Machine Learning Studio - a simple browser-based, drag-and-drop environment which functions without any kind of coding. You are provided with fully-managed cloud services that enable you to easily build, deploy and share predictive analytics solutions. Here you are also provided with a platform (Gallery) to share and contribute to the community. Pros: Consists of most versatile toolset for MLaaS You can contribute to and reuse machine learning solutions from the community Cons: Comparatively expensive A lot of manual work is required Watson Machine Learning Similar to the above platforms, IBM Watson Machine Learning is a service that helps  users to create, train, and deploy self-learning models to integrate predictive capabilities within their applications. This platform provides automated and collaborative workflows to grow intelligent business applications. Pros: Automated workflows Data science skills is not necessary Cons: Comparatively limited APIs and services Lacks streaming analytics Selecting the machine learning as a service solution that's right for you There are so many machine learning as a service solutions out there that it's easy to get confused. The crucial step to take before you make a decision to purchase anything is to plan your business requirements. Think carefully not only about what you want to achieve, but what you already do too. You want your MLaaS solution to easily integrate into the way you currently work. You also don't want it to replicate any work you're currently doing that you're pretty happy with. It gets repeated so much but it remains as true as it has ever been - make sure your software decisions are fully aligned with your business needs. It's easy to get seduced by the promise of innovative new tools, but without the right alignment they're not going to help you at all.
Read more
  • 0
  • 0
  • 4329
article-image-ibm-think-2018-key-takeaways-developers
Amey Varangaonkar
17 Apr 2018
5 min read
Save for later

IBM Think 2018: 6 key takeaways for developers

Amey Varangaonkar
17 Apr 2018
5 min read
This year, IBM Think 2018 was hosted in Las Vegas from March 20 to 22. It was one of the most anticipated IBM events in 2018, with over 40,000 developers as well as technology and business leaders in attendance. Considered IBM’s flagship conference, Think 2018 combined previous conferences such as IBM InterConnect and World of Watson. IBM Think 2018: Key Takeaways IBM Watson Studio announced - A platform where data professionals in different roles can come together and build end-to-end Artificial Intelligence workflows Integration of IBM Watson with Apple's Core ML, for incorporating custom machine learning models into iOS apps IBM Blockchain platform announced, for Blockchain developers to build enterprise-grade decentralized applications Deep Learning as a Service announced as a part of the Watson Studio, allowing you to train deep learning models more efficiently Fabric for Deep Learning open-sourced, so that you can use the open source deep learning framework to train your models and then integrate them with the Watson Studio Neural Network Modeler announced for Watson Studio, a GUI tool to design neural networks efficiently, without a lot of manual coding IBM Watson Assistant announced, an AI-powered digital assistant, for automotive vehicles and hospitality Here are some of the announcements and key takeaways which have excited us, as well as the developers all around the world! IBM Watson Studio announced One of the biggest announcements of the event was the IBM Watson Studio - a premier tool that brings together data scientists, developers and data engineers to collaborate, build and deploy end-to-end data workflows. Right from accessing your data source to deploying accurate and high performance models, this platform does it all. It is just what enterprises need today to leverage Artificial Intelligence in order to accelerate research, and get intuitive insights from their data. IBM Watson Studio's Lead Product Manager, Armand Ruiz, gives a sneak-peek into what we can expect from Watson Studio. Collaboration with Apple Core ML IBM took their relationship with Apple to another level by announcing their collaboration to develop smarter iOS applications. IBM Watson’s Visual Recognition Service can be used to train custom Core ML machine learning models, which can be directly used by iOS apps. The latest announcement at IBM Think 2018 comes as no surprise to us, considering IBM had released new developer tools for enterprise development using the Swift language. IBM Watson Assistant announced IBM Think 2018 also announced the evolution of Watson Conversation to Watson Assistant, introducing new features and capabilities to deliver a more engaging and personalized customer experience. With this, IBM plans to take the concept of AI assistants for businesses on to a new level. Currently in the beta program, there are 2 domain-specific solutions available for use on top of Watson Assistant - namely Watson Assistant for Automotive and Watson Assistant for Hospitality. IBM Blockchain Platform Per Juniper Research, more than half of the world’s big corporations are considering adoption of or are already in the process of adopting Blockchain technology. This presents a serious opportunity for a developer centric platform that can be used to build custom decentralized networks. IBM, unsurprisingly, has identified this opportunity and come up with a Blockchain development platform of their own - the IBM Blockchain Platform. Recently launched as a beta, this platform offers a pay-as-you-use option for Blockchain developers to develop their own enterprise-grade Blockchain solutions without any hassle. Deep Learning as a Service Training a deep learning model is quite tricky, as it requires you to design the right kind of neural networks along with having the right hyperparameters. This is a significant pain point for the data scientists and machine learning engineers. To tackle this problem,  IBM announced the release of Deep Learning as a Service as part of the Watson Studio. It includes the Neural Network Modeler (explained in detail below) to simplify the process of designing and training neural networks. Alternatively, using this service, you can leverage popular deep learning libraries and frameworks such as PyTorch, Tensorflow, Caffe, Keras to train your neural networks manually. In the process, IBM also open sourced the core functionalities of Deep Learning as a Service as a separate project - namely Fabric for Deep Learning. This allows models to be trained using different open source frameworks on Kubernetes containers, and also make use of the GPUs’ processing power. These models can then eventually be integrated to the Watson Studio. Accelerating deep learning with the Neural Network Modeler In a bid to reduce the complexities and the manual work that go into designing and training neural networks, IBM introduced a beta release of the Neural Network Modeler within the Watson Studio. This new feature allows you to design and model standardized neural network models without going into a lot of technical details, thanks to its intuitive GUI. With this announcement, IBM aims to accelerate the overall process of deep learning, so that the data scientists and machine learning developers can focus on the thinking more than operational side of things. At Think 2018, we also saw the IBM Research team present their annual ‘5 in 5’ predictions. This session highlighted the 5 key innovations that are currently in research, and are expected to change our lives in the near future. With these announcements, it’s quite clear that IBM are well in sync with the two hottest trends in the tech space today - namely Artificial Intelligence and Blockchain. They seem to be taking every possible step to ensure they’re right up there as the preferred choice of tool for data scientists and machine learning developers. We only expect the aforementioned services to get better and have more mainstream adoption with time, as most of these services are currently in the beta stage. Not just that, there’s scope for more improvements and addition of newer functionalities as they develop these platforms. What did you think of these announcements by IBM? Do let us know!
Read more
  • 0
  • 0
  • 4315

article-image-organisation-needs-to-know-about-gdpr
Aaron Lazar
16 Apr 2018
5 min read
Save for later

What your organisation needs to know about GDPR

Aaron Lazar
16 Apr 2018
5 min read
GDPR is an acronym that has been doing the rounds for a couple of years now. It’s become even more visible in the last few weeks, thanks to the Facebook and Cambridge Analytica data hijacking scandal. And with the deadline date looming - 25 May 2018 - every organization on the planet needs to make sure their on top of things. But what is GDPR exactly? And how is it going to affect you? What is GDPR? Before April, 2016, a data protection directive enforced in 1995 was in place. This governed all organisations that dealt with collecting, storing and processing data. This directive became outdated with rapidly evolving technological trends, which meant a revised directive was needed. In April 2016, the European Union drew up General Data Protection Regulation. It has been specifically created to to protect the personal data and privacy of European citizens. It's important to note at this point that the directive doesn't just apply to EU organizations - it applies to anyone who deals with data on EU citizens. A relatively new genre of crime involving stealing data, has cropped up over the past decade. Data is so powerful, that its misuse could be devastating, possibly resulting in another world war. GDPR aims to set a new benchmark for the protection of consumer data rights by making organisations more accountable. Governed by GDPR, organisations will now be responsible for guarding every quantum of information that is connected to an individual, including IP addresses and web cookies! Read more: Why GDPR is good for everyone. Why should organizations bother with GDPR? In December 2017, the RSA, one of the first cryptosystems and security organisations, surveyed 7,500 customers in France, Italy, Germany, the UK and the US, and the results were interesting. When asked what their main concern was, customers responded that lost passwords, banking information, passports and other important documents were their major concern. The more interesting part was that over 60% of the respondents said that in the event of a breach, they would blame the organisation that lost their data rather than the hacker. If you work for or own a company that deals with the data of EU citizens, you’ll probably have GDPR on your radar. If you don’t comply, you’ll face a hefty fine - more on that below. What kind of data are we talking about? The GDPR aims to protect data related to identity information like name, physical address, sexual orientation and more. It also covers any ID numbers; IP addresses, cookies and RFID tags; genetic and any data related to health; biometric data like fingerprints, retina scans, etc; racial or ethnic data; political opinions. Who must comply with GDPR? You’ll be governed by GDPR if: You’re a company located in the EU You’re not located in the EU but you still process data of EU citizens You have more than 250 employees You have lesser than 250 employees but process data that could impact the rights and freedom of EU citizens When does GDPR come into force? In case you missed it in the first paragraph, GDPR comes into effect on 25 May 2018. If you're not ready yet, now is the time to scramble to get things right and make sure you comply with GDPR regulations. What if you don’t make the date? Unlike an invitation to a birthday party, if you miss the date to comply with the GDPR, you’re likely to be fined to the tune of €20 million or 4% of the worldwide turnover of your company. A more relaxed fine includes €10 million or 2% of the worldwide turnover of your company, for misusing data in ways involving failure to report a data breach, failure to incorporate privacy by design and failure to ensure that data protection is applied at the initial stage of a project. It also includes the failure to hire a Data Protection Officer/Chief Data Officer, who has professional experience and knowledge of data protection laws that are proportionate to what the organisation carries out. If it makes you feel any better, you’re not the only one. A report from Ovum states that more than 50% of the companies feel they’re most likely to be fined for non compliance. How do you prepare for GDPR? Well, here are a few honest steps that you could perform to ensure a successful compliance: Prepare to shell out between $1 million to $10 million to meet GDPR requirements Hire a DPO or a CDO who’s capable of handling all your data policies and migration Fully understand GDPR and its requirements Perform a risk assessment, understand what kind of data you store and what implications it might have Strategize to mitigate that risk Review/Create your data protection plan Plan for a 72 hour incident response system Implement internal plans and policies to ensure employees follow For the third time then - time is running out! It’s imperative that you ensure your organisation complies with GDPR before the 25th of May, 2018. We’ll follow up with some more thoughts to help you make the shift, as well as give you more insight into this game changing regulation. If you own or are part of an organisation that has migrated to comply with GDPR, please share some tips in the comments section below to help others still in the midst of the transition.
Read more
  • 0
  • 0
  • 6645