Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Tech Guides - Data Analysis

34 Articles
article-image-9-data-science-myths-debunked
Amey Varangaonkar
03 Jul 2018
9 min read
Save for later

9 Data Science Myths Debunked

Amey Varangaonkar
03 Jul 2018
9 min read
The benefits of data science are evident for all to see. Not only does it equip you with the tools and techniques to make better business decisions, the predictive power of analytics also allows you to determine future outcomes - something that can prove to be crucial to businesses. Despite all these advantages, data science is a touchy topic for many businesses. It’s worth looking at some glaring stats that show why businesses are reluctant to adopt data science: Poor data across businesses and organizations - in both private and government costs the U.S economy close to $3 Trillion per year. Only 29% enterprises are able to properly leverage the power of Big Data and derive useful business value from it. These stats show a general lack of awareness or knowledge when it comes to data science. It could be due to some preconceived notions, or simply lack of knowledge and its application that seems to be a huge hurdle to these companies. In this article, we attempt to take down some of these notions and give a much clearer picture of what data science really is. Here are 5 of the most common myths or misconceptions in data science, and why are absolutely wrong: Data Science is just a fad, it won’t last long This is probably the most common misconception. Many tend to forget that although ‘data science’ is a recently coined term, this field of study is a cumulation of decades of research and innovation in statistical methodologies and tools. It has been in use since the 1960s or even before - just that the scale at which it was being used then was small. Back in the day, there were no ‘data scientists’, but just statisticians and economists who used the now unknown terms such as ‘data fishing’ or ‘data dredging’. Even the terms ‘data analysis’ and ‘data mining’ only went mainstream in the 1990s, but they were in use way before that period. Data Science’s rise to fame has coincided with the exponential rise in the amount of data being generated every minute. The need to understand this information and make positive use of it led to an increase in the demand for data science. Now with Big Data and Internet of Things going wild, the rate of data generation and the subsequent need for its analysis will only increase. So if you think data science is a fad that will go away soon, think again. Data Science and Business Intelligence are the same Those who are unfamiliar with what data science and Business Intelligence actually entail often get confused, and think they’re one and the same. No, they’re not. Business Intelligence is an umbrella term for the tools and techniques that give answers to the operational and contextual aspects of your business or organization. Data science, on the other hand has more to do with collecting information in order to build patterns and insights. Learning about your customers or your audience is Business Intelligence. Understanding why something happened, or whether it will happen again, is data science. If you want to gauge how changing a certain process will affect your business, data science - not Business Intelligence - is what will help you. Data Science is only meant for large organizations with large resources Many businesses and entrepreneurs are wrongly of the opinion that data science is - or works best - only for large organizations. It is a wrongly perceived notion that you need sophisticated infrastructure to process and get the most value out of your data. In reality, all you need is a bunch of smart people who know how to get the best value of the available data. When it comes to taking a data-driven approach, there’s no need to invest a fortune in setting up an analytics infrastructure for an organization of any scale. There are many open source tools out there which can be easily leveraged to process large-scale data with efficiency and accuracy. All you need is a good understanding of the tools. It is difficult to integrate data science systems with the organizational workflow With the advancement of tech, one critical challenge that has now become very easy to overcome is to collaborate with different software systems at once. With the rise of general-purpose programming languages, it is now possible to build a variety of software systems using a single programming language. Take Python for example. You can use it to analyze your data, perform machine learning or develop neural networks to work on more complex data models. All this while, you can link your web API designed in Python to communicate with these data science systems. There are provisions being made now to also integrate codes written in different programming languages while ensuring smooth interoperability and no loss of latency. So if you’re wondering how to incorporate your analytics workflow in your organizational workflow, don’t worry too much. Data Scientists will be replaced by Artificial Intelligence soon Although there has been an increased adoption of automation in data science, the notion that the work of a data scientist will be taken over by an AI algorithm soon is rather interesting. Currently, there is an acute shortage of data scientists, as this McKinsey Global Report suggests. Could this change in the future? Will automation completely replace human efforts when it comes to data science? Surely machines are a lot better than humans at finding patterns; AI best the best go player, remember. This is what the common perception seems to be, but it is not true. However sophisticated the algorithms become in automating data science tasks, we will always need a capable data scientist to oversee them and fine-tune their performance. Not just that, businesses will always need professionals with strong analytical and problem solving skills with relevant domain knowledge. They will always need someone to communicate the insights coming out of the analysis to non-technical stakeholders. Machines don’t ask questions of data. Machines don’t convince people. Machines don’t understand the ‘why’. Machines don’t have intuition. At least, not yet. Data scientists are here to stay, and their demand is not expected to go down anytime soon. You need a Ph.D. in statistics to be a data scientist No, you don’t. Data science involves crunching numbers to get interesting insights, and it often involves the use of statistics to better understand the results. When it comes to performing some advanced tasks such as machine learning and deep learning, sure, an advanced knowledge of statistics helps. But that does not imply that people who do not have a degree in maths or statistics cannot become expert data scientists. Today, organizations are facing a severe shortage of data professionals capable of leveraging the data to get useful business insights. This has led to the rise of citizen data scientists - meaning professionals who are not experts in data science, but can use the data science tools and techniques to create efficient data models. These data scientists are no experts in statistics and maths, they just know the tool inside out, ask the right questions, and have the necessary knowledge of turning data into insights. Having an expertise of the data science tools is enough Many people wrongly think that learning a statistical tool such as SAS, or mastering Python and its associated data science libraries is enough to get the data scientist tag. While learning a tool or skill is always helpful (and also essential), by no means is it the only requisite to do effective data science. One needs to go beyond the tools and also master skills such as non-intuitive thinking, problem-solving, and knowing the correct practical applications of a tool to tackle any given business problem. Not just that, it requires you to have excellent communication skills to present your insights and findings related to the most complex of analysis to other stakeholders, in a way they can easily understand and interpret. So if you think that a SAS certification is enough to get you a high-paying data science job and keep it, think again. You need to have access to a lot of data to get useful insights Many small to medium-sized businesses don’t adopt a data science framework because they think it takes lots and lots of data to be able to use the analytics tools and techniques. Data when present in bulk, always helps, true, but don’t need hundreds of thousands of records to identify some pattern, or to extract relevant insights. Per IBM, data science is defined by the 4 Vs of data, meaning Volume, Velocity, Veracity and Variety. If you are able to model your existing data into one of these formats, it automatically becomes useful and valuable. Volume is important to an extent, but it’s the other three parameters that add the required quality. More data = more accuracy Many businesses collect large hordes of information and use the modern tools and frameworks available at their disposal for analyzing this data. Unfortunately, this does not always guarantee accurate results. Neither does it guarantee useful actionable insights or more value. Once the data is collected, the preliminary analysis on what needs to be done with the data is required. Then, we use the tools and frameworks at our disposal to extract the relevant insights and built an appropriate data model. These models need to be fine-tuned as per the processes for which they will be used. Then, eventually, we get the desired degree of accuracy from the model. Data in itself is quite useless. It’s how we work on it - more precisely, how effectively we work on it - that makes all the difference. So there you have it! Data science is one of the most popular skills to have in your resume today, but it is important to first clear all the confusions and misconceptions that you may have about it. Lack of information or misinformation can do more harm than good, when it comes to leveraging the power of data science within a business - especially considering it could prove to be a differentiating factor for its success and failure. Do you agree with our list? Do you think there are any other commonly observed myths around data science that we may have missed? Let us know. Read more 30 common data science terms explained Why is data science important? 15 Useful Python Libraries to make your Data Science tasks Easier
Read more
  • 0
  • 0
  • 7028

article-image-ubers-kepler-gl-an-open-source-toolbox-for-geospatial-analysis
Pravin Dhandre
28 Jun 2018
4 min read
Save for later

Uber's kepler.gl, an open source toolbox for GeoSpatial Analysis

Pravin Dhandre
28 Jun 2018
4 min read
Geography Visualization, also called as Geovisualization plays a pivotal role in areas like cartography, geographic information systems, remote sensing and global positioning systems. Uber, a peer-to-peer transportation network company headquartered at California believes in data-driven decision making and hence keeps developing smart frameworks like deck.gl for exploring and visualizing advanced geospatial data at scale. Uber strives to make the data web-based and shareable in real-time across their teams and customers. Early this month, Uber surprised the geospatial market with its newly open-source toolbox, kepler.gl, a geoanalytics tool to gain quick insights from geospatial data with amazing and intuitive visualizations. What’s exactly Kepler.gl is? kepler.gl is a visualization-rich web platform, developed on top of deck.gl, a WebGL-powered data visualization library providing real-time visual analytics of millions of geolocation points. The platform provides visual exploration of geographical data sets along with spatial aggregation of all data points collected. The platform is said to be data-agnostic with a single interface to convert your data into insightful visualizations. https://www.youtube.com/watch?v=i2fRN4e2s0A The platform is very user-friendly where one can just drag the CSV or the GeoJSON files and drop them into the browser to visualize the dataset more intuitively. The platform is supported with different map layers, filtering option, aggregation feature through which you can get the final visualization in an animated format or like a video. The usability of features is so high that you can apply all the metrics available to your data points without much of a hassle. The web platform exhibits high performance where you can get insights from your spatial data in less than 10 minutes and that too in a single window. Another advantage of this framework is it does not involve any sort of coding and hence non-technical users can also reap the benefits by churn valuable insights from the data points. The platform is also equipped with some advanced, complex features such as 2D cartographic plane,a separate dimension for altitude, visibility of height of hexagon and grids. The users seem happy with the new height feature which helps them detect abnormalities and illicit traits in an aggregated map. With the filtering menu, the analysts and engineers can compare their data and have a granular look at their data points. This option also helps in reading the histogram well and one can easily detect outliers and make their dataset more reliable. It  has a feature to add playback to time series data points which makes getting useful information of real time location systems easy. The team at Uber looks at this toolbox with a long-term vision where they are planning to keep adding new features and enhancements to make it highly functional and a single-click visualization dashboard. The team has already announced that they would be powering it up with two major enhancements to the current functionality in next couple of months. They would add support on, More robust exploration: There will be interlinkage between charts and maps, and support for custom charts, maps and widgets like the renowned BI tool Tableau through which it will facilitate analytics teams to unveil deeper insights. Addition of newer geo-analytical capabilities: To support massive datasets, there will be added features on data operations such as polygon aggregation, union of data points, operations like joining and buffering. Companies across different verticals such as Airbnb, Atkins Global, Cityswifter, Mapbox have found great value in kepler.gl offerings and are looking towards engineering their products to leverage this framework. The visualization specialists at these companies have already praised Uber for building such a simple yet fast platform with remarkable capabilities. To get started with kepler.gl, read the documentation available at Github and start creating visualizations and enhance your geospatial data analysis. Top 7 libraries for geospatial analysis Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data Data Visualization with ggplot2
Read more
  • 0
  • 0
  • 6943

article-image-python-tensorflow-excel-and-more-data-professionals-reveal-their-top-tools
Amey Varangaonkar
06 Jun 2018
4 min read
Save for later

Python, Tensorflow, Excel and more - Data professionals reveal their top tools

Amey Varangaonkar
06 Jun 2018
4 min read
Data professionals are constantly on the lookout for the best tools to simplify their data science tasks - be it data acquisition, machine learning, or visualizing the results of the analysis. With so much on their plate already, having robust, efficient tools in the arsenal helps them a lot in reducing the procedural complexities. Not just that, the time taken to do these tasks is considerably reduced as well. But what tools do data professionals rely on to make their lives easier? Thanks to the Skill-up 2018 survey that we recently conducted, we have some interesting observations to share with you! Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. Key Takeaways Python is the most widely used programming language by data professionals Python finds a wide adoption across all spectrums of data science - including data analysis, machine learning, deep learning and data visualization Excel continues to be favored by the data professionals because of its effectiveness and simplicity R is slowly falling behind Python in the race to Data Science supremacy Now, let’s look at these observations, in more depth. Python continues its ascension as the top dog Python’s rise in popularity as well as adoption over the last 3 years has been quite staggering, to say the least. Python’s ease of use, powerful analytical and machine learning capabilities as well as its applications outside of data science make it quite a popular language in the tech community. It thus comes as no surprise that it stood out from the others and was the undisputed choice of language for the data pros. R, on the other hand, seems to be finding it difficult to play catch-up to Python, with less than half the number of votes - despite being the tool of choice for many statisticians and researchers. Is the paradigm shift well and truly on? Is Python edging R out for good? Source: Packt Skill-Up Survey 2018 It is interesting to see SQL as the number 2, but considering the number of people working with databases these days it doesn’t come as a surprise. Also, JavaScript is preferred more than Java, indicating the rising need for web-based dashboards for effective Business Intelligence. Data professionals still love Excel, but Python libraries are taking over Microsoft Excel has traditionally been a highly popular tool for data analysis, especially when dealing with data with hundreds and thousands of records. Excel’s perfect setting for data manipulation and charting continues to be the reason why people still use it for basic-level data analysis, as indicated by our survey. Almost 53% of the respondents prefer having Excel in their analysis toolkit for their day to day tasks. Top libraries, tools and frameworks used by data professionals (Source: Packt Skill-Up Survey 2018) The survey also indicated Python’s rising dominance in the data science domain, with 8 out of the 10 most-used tools for data analysis being Python-based. Python’s offerings for data wrangling, scientific computing, machine learning and deep learning make its libraries the obvious choice for data professionals. Here’s a quick look at  15 useful Python libraries to make the above-mentioned data science tasks easier. Tensorflow and PyTorch are in demand AI’s popularity is soaring with every passing day as it finds applications across all types of industries and business domains. In our survey, we found machine learning and deep learning to be two of the most valuable skills to have for any data scientist, as can be seen from the word cloud below: Word cloud for the most valued skills by data professionals (Source: Packt Skill-Up Survey) Python’s two popular deep learning frameworks - Tensorflow and PyTorch have thus gained a lot of attention and adoption in the recent times. Along with Keras - another Python library - these two libraries are the most used frameworks used by data scientists and ML developers for building efficient machine learning and deep learning models. Which language/libraries do you use for your everyday Data Science tasks? Do you agree with your peers’ choice of tools? Feel free to let us know! Read more Data cleaning is the worst part of data analysis, say data scientists 30 common data science terms explained Top 10 deep learning frameworks
Read more
  • 0
  • 0
  • 6695
Banner background image

article-image-why-enterprises-love-the-elastic-stack
Pravin Dhandre
31 May 2018
2 min read
Save for later

Why Enterprises love the Elastic Stack

Pravin Dhandre
31 May 2018
2 min read
Business insights has always been a hotspot by companies and with data that keep flowing, growing and becoming fat by the day, analytics need to be quicker, real-time and reliable. Analytics that can’t match up today’s data provide insights that become almost lifeless to market dynamics. The question then is, is there an analytics solution that can tackle the data hydra? Elastic Stack is your answer. It is power packed with tools like Elasticsearch, Kibana, Logstash, X-Pack and Beats that takes data from any source, in any format, and provide instant search, analysis, and visualization in real time. With over 225 million downloads, it is a clear crowd favorite. Enterprises get an addon benefit in using it as a single analytical suite or getting it integrated with other products, delivering real-time actionable insights and decisions every time. Why Enterprises love the Elastic Stack? Some of the common things that enterprises love about the Elastic Stack is its being open source platform. The next thing that IT companies enjoys is its super fast distributed search mechanism that makes your queries run faster and much efficient. Apart from this, its bundling with Kibana and Logstash makes it awesome for IT infrastructure and DevOps teams who can aggregate and analyze billions of logs with ease. Its simple and robust analysis platform provides distinct advantage over Splunk, Solr, Sphinx, Ambar and many other alternative product suites. Also, its SaaS option allows customers to perform log analytics, full text search and application monitoring over the cloud with utmost ease and reasonable pricing. Companies like Amazon, Bloomberg, Ebay, SAP, Citibank, Sony, Mozilla, Wordpress, SalesForce are already been using Elastic Stack, powering their search and analytics to combat their daily business challenges. Whether it is an educational institution, travel agency, e-commerce, or a financial institution, the Elastic stack is empowering millions of companies with real-time metrics, strong analytics, better search experience and high customer satisfaction. How to install Elasticsearch in Ubuntu and Windows How to perform Numeric Metric Aggregations with Elasticsearch CRUD (Create Read, Update and Delete) Operations with Elasticsearch
Read more
  • 0
  • 0
  • 3921

article-image-libraries-for-geospatial-analysis
Aarthi Kumaraswamy
22 May 2018
12 min read
Save for later

Top 7 libraries for geospatial analysis

Aarthi Kumaraswamy
22 May 2018
12 min read
The term geospatial refers to finding information that is located on the earth's surface. This can include, for example, the position of a cellphone tower, the shape of a road, or the outline of a country. Geospatial data often associates some piece of information with a particular location. Geospatial development is the process of writing computer programs that can access, manipulate, and display this type of information. Internally, geospatial data is represented as a series of coordinates, often in the form of latitude and longitude values. Additional attributes, such as temperature, soil type, height, or the name of a landmark, are also often present. There can be many thousands (or even millions) of data points for a single set of geospatial data. In addition to the prosaic tasks of importing geospatial data from various external file formats and translating data from one projection to another, geospatial data can also be manipulated to solve various interesting problems. Obvious examples include the task of calculating the distance between two points, calculating the length of a road, or finding all data points within a given radius of a selected point. We use libraries to solve all of these problems and more. Today we will look at the major libraries used to process and analyze geospatial data. GDAL/OGR GEOS Shapely Fiona Python Shapefile Library (pyshp) pyproj Rasterio GeoPandas This is an excerpt from the book, Mastering Geospatial Analysis with Python by Paul Crickard, Eric van Rees, and Silas Toms. Geospatial Data Abstraction Library (GDAL) and the OGR Simple Features Library The Geospatial Data Abstraction Library (GDAL)/OGR Simple Features Library combines two separate libraries that are generally downloaded together as a GDAL. This means that installing the GDAL package also gives access to OGR functionality. The reason GDAL is covered first is that other packages were written after GDAL, so chronologically, it comes first. As you will notice, some of the packages covered in this post extend GDAL's functionality or use it under the hood. GDAL was created in the 1990s by Frank Warmerdam and saw its first release in June 2000. Later, the development of GDAL was transferred to the Open Source Geospatial Foundation (OSGeo). Technically, GDAL is a little different than your average Python package as the GDAL package itself was written in C and C++, meaning that in order to be able to use it in Python, you need to compile GDAL and its associated Python bindings. However, using conda and Anaconda makes it relatively easy to get started quickly. Because it was written in C and C++, the online GDAL documentation is written in the C++ version of the libraries. For Python developers, this can be challenging, but many functions are documented and can be consulted with the built-in pydoc utility, or by using the help function within Python. Because of its history, working with GDAL in Python also feels a lot like working in C++ rather than pure Python. For example, a naming convention in OGR is different than Python's since you use uppercase for functions instead of lowercase. These differences explain the choice for some of the other Python libraries such as Rasterio and Shapely, which are also covered in this chapter, that has been written from a Python developer's perspective but offer the same GDAL functionality. GDAL is a massive and widely used data library for raster data. It supports the reading and writing of many raster file formats, with the latest version counting up to 200 different file formats that are supported. Because of this, it is indispensable for geospatial data management and analysis. Used together with other Python libraries, GDAL enables some powerful remote sensing functionalities. It's also an industry standard and is present in commercial and open source GIS software. The OGR library is used to read and write vector-format geospatial data, supporting reading and writing data in many different formats. OGR uses a consistent model to be able to manage many different vector data formats. You can use OGR to do vector reprojection, vector data format conversion, vector attribute data filtering, and more. GDAL/OGR libraries are not only useful for Python programmers but are also used by many GIS vendors and open source projects. The latest GDAL version at the time of writing is 2.2.4, which was released in March 2018. GEOS The Geometry Engine Open Source (GEOS) is the C/C++ port of a subset of the Java Topology Suite (JTS) and selected functions. GEOS aims to contain the complete functionality of JTS in C++. It can be compiled on many platforms, including Python. As you will see later on, the Shapely library uses functions from the GEOS library. In fact, there are many applications using GEOS, including PostGIS and QGIS. GeoDjango, also uses GEOS, as well as GDAL, among other geospatial libraries. GEOS can also be compiled with GDAL, giving OGR all of its capabilities. The JTS is an open source geospatial computational geometry library written in Java. It provides various functionalities, including a geometry model, geometric functions, spatial structures and algorithms, and i/o capabilities. Using GEOS, you have access to the following capabilities—geospatial functions (such as within and contains), geospatial operations (union, intersection, and many more), spatial indexing, Open Geospatial Consortium (OGC) well-known text (WKT) and well-known binary (WKB) input/output, the C and C++ APIs, and thread safety. Shapely Shapely is a Python package for manipulation and analysis of planar features, using functions from the GEOS library (the engine of PostGIS) and a port of the JTS. Shapely is not concerned with data formats or coordinate systems but can be readily integrated with such packages. Shapely only deals with analyzing geometries and offers no capabilities for reading and writing geospatial files. It was developed by Sean Gillies, who was also the person behind Fiona and Rasterio. Shapely supports eight fundamental geometry types that are implemented as a class in the shapely.geometry module—points, multipoints, linestrings, multilinestrings, linearrings, multipolygons, polygons, and geometrycollections. Apart from representing these geometries, Shapely can be used to manipulate and analyze geometries through a number of methods and attributes. Shapely has mainly the same classes and functions as OGR while dealing with geometries. The difference between Shapely and OGR is that Shapely has a more Pythonic and very intuitive interface, is better optimized, and has a well-developed documentation. With Shapely, you're writing pure Python, whereas with GEOS, you're writing C++ in Python. For data munging, a term used for data management and analysis, you're better off writing in pure Python rather than C++, which explains why these libraries were created. For more information on Shapely, consult the documentation. This page also has detailed information on installing Shapely for different platforms and how to build Shapely from the source for compatibility with other modules that depend on GEOS. This refers to the fact that installing Shapely will require you to upgrade NumPy and GEOS if these are already installed. Fiona Fiona is the API of OGR. It can be used for reading and writing data formats. The main reason for using it instead of OGR is that it's closer to Python than OGR as well as more dependable and less error-prone. It makes use of two markup languages, WKT and WKB, for representing spatial information with regards to vector data. As such, it can be combined well with other Python libraries such as Shapely, you would use Fiona for input and output, and Shapely for creating and manipulating geospatial data. While Fiona is Python compatible and our recommendation, users should also be aware of some of the disadvantages. It is more dependable than OGR because it uses Python objects for copying vector data instead of C pointers, which also means that they use more memory, which affects the performance. Python shapefile library (pyshp) The Python shapefile library (pyshp) is a pure Python library and is used to read and write shapefiles. The pyshp library's sole purpose is to work with shapefiles—it only uses the Python standard library. You cannot use it for geometric operations. If you're only working with shapefiles, this one-file-only library is simpler than using GDAL. pyproj The pyproj is a Python package that performs cartographic transformations and geodetic computations. It is a Cython wrapper to provide Python interfaces to PROJ.4 functions, meaning you can access an existing library of C code in Python. PROJ.4 is a projection library that transforms data among many coordinate systems and is also available through GDAL and OGR. The reason that PROJ.4 is still popular and widely used is two-fold: Firstly, because it supports so many different coordinate systems Secondly, because of the routes it provides to do this—Rasterio and GeoPandas, two Python libraries covered next, both use pyproj and thus PROJ.4 functionality under the hood The difference between using PROJ.4 separately instead of using it with a package such as GDAL is that it enables you to re-project individual points, and packages using PROJ.4 do not offer this functionality. The pyproj package offers two classes—the Proj class and the Geod class. The Proj class performs cartographic computations, while the Geod class performs geodetic computations. Rasterio Rasterio is a GDAL and NumPy-based Python library for raster data, written with the Python developer in mind instead of C, using Python language types, protocols, and idioms. Rasterio aims to make GIS data more accessible to Python programmers and helps GIS analysts learn important Python standards. Rasterio relies on concepts of Python rather than GIS. Rasterio is an open source project from the satellite team of Mapbox, a provider of custom online maps for websites and applications. The name of this library should be pronounced as raster-i-o rather than ras-te-rio. Rasterio came into being as a result of a project called the Mapbox Cloudless Atlas, which aimed to create a pretty-looking basemap from satellite imagery. One of the software requirements was to use open source software and a high-level language with handy multi-dimensional array syntax. Although GDAL offers proven algorithms and drivers, developing with GDAL's Python bindings feels a lot like C++. Therefore, Rasterio was designed to be a Python package at the top, with extension modules (using Cython) in the middle, and a GDAL shared library on the bottom. Other requirements for the raster library were being able to read and write NumPy ndarrays to and from data files, use Python types, protocols, and idioms instead of C or C++ to free programmers from having to code in two languages. For georeferencing, Rasterio follows the lead of pyproj. There are a couple of capabilities added on top of reading and writing, one of them being a features module. Reprojection of geospatial data can be done with the rasterio.warp module. Rasterio's project homepage can be found on Github. GeoPandas GeoPandas is a Python library for working with vector data. It is based on the pandas library that is part of the SciPy stack. SciPy is a popular library for data inspection and analysis, but unfortunately, it cannot read spatial data. GeoPandas was created to fill this gap, taking pandas data objects as a starting point. The library also adds functionality from geographical Python packages. GeoPandas offers two data objects—a GeoSeries object that is based on a pandas Series object and a GeoDataFrame, based on a pandas DataFrame object, but adding a geometry column for each row. Both GeoSeries and GeoDataFrame objects can be used for spatial data processing, similar to spatial databases. Read and write functionality is provided for almost every vector data format. Also, because both Series and DataFrame objects are subclasses from pandas data objects, you can use the same properties to select or subset data, for example .loc or .iloc. GeoPandas is a library that employs the capabilities of newer tools, such as Jupyter Notebooks, pretty well, whereas GDAL enables you to interact with data records inside of vector and raster datasets through Python code. GeoPandas takes a more visual approach by loading all records into a GeoDataFrame so that you can see them all together on your screen. The same goes for plotting data. These functionalities were lacking in Python 2 as developers were dependent on IDEs without extensive data visualization capabilities which are now available with Jupyter Notebooks. We've provided an overview of the most important open source packages for processing and analyzing geospatial data. The question then becomes when to use a certain package and why. GDAL, OGR, and GEOS are indispensable for geospatial processing and analyzing, but were not written in Python, and so they require Python binaries for Python developers. Fiona, Shapely, and pyproj were written to solve these problems, as well as the newer Rasterio library. For a more Pythonic approach, these newer packages are preferable to the older C++ packages with Python binaries (although they're used under the hood). Now that you have an idea of what options are available for a certain use case and why one package is preferable over another, here’s something you should always remember. As is often the way in programming, there might be multiple solutions for one particular problem. For example, when dealing with shapefiles, you could use pyshp, GDAL, Shapely, or GeoPandas, depending on your preference and the problem at hand. Introduction to Data Analysis and Libraries 15 Useful Python Libraries to make your Data Science tasks Easier “Pandas is an effective tool to explore and analyze data”: An interview with Theodore Petrou Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data  
Read more
  • 0
  • 0
  • 20206

article-image-dask-library-scalable-analytics-python
Amey Varangaonkar
22 May 2018
6 min read
Save for later

Introducing Dask: The library that makes scalable analytics in Python easier

Amey Varangaonkar
22 May 2018
6 min read
Python’s rise as the preferred language of choice in Data Science is unprecedented, but not really unexpected. Apart from being a general-purpose language which can be used for a variety of tasks - from scripting to networking, Python offers a rich suite of libraries for general data science tasks such as scientific computing, data visualization, and more. However, one big challenge faced by the data scientists is that these packages are not designed for scale. This is crucial in today’s Big Data era where tons of data needs to be processed and analyzed on the go. A platform which supports the existing Python ecosystem and allows it to scale across multiple machines and clusters without affecting the performance was conspicuously missing. Enter Dask. What is Dask? Dask is a flexible parallel computing library written in Python for analytics, designed mainly to offer scalability and enhanced power to the existing packages and libraries. It allows the users to integrate their existing Python-based projects written in popular libraries such as NumPy, SciPy, pandas, and more. Architecture is demonstrated in the diagram below: Architecture (Image courtesy: Slideshare) The 2 key components of Dask that interact with the Python libraries are: Dynamic task schedulers - which takes care of the intensive computational workloads ‘Big Data’ Dask collections - consisting of dataframes, parallel arrays and interfaces that allow for the computations to run on distributed environments Why use Dask? Given there are already quite a few distributed platforms for large-scale data processing such as Apache Spark, Apache Storm, Flink and so on, why and when should one go for Dask? What are the advantages offered by this Python library? Let us take a look at the 4 major reasons to prefer Dask for distributed, scalable analytics in Python: Easy to get started: If you are an existing Python user, you must have already worked with popular Python packages such as NumPy, SciPy, matplotlib, scikit-learn, pandas, and more. Dask offers a similar, intuitive interface and since it is a part of the bigger Python ecosystem, getting started with Dask is very easy. It uses the existing Python APIs to switch between the popular packages and their Dask-equivalents, so you don’t have to spend a lot of time in porting the code. For absolute beginners, using Dask for scalable analytics would be an easier and logical option to pursue, once they have grasped the fundamentals of Python and the associated libraries. Scales up and down quite easily: You can run your project on Dask on a single machine, or on a cluster with thousands of cores without essentially affecting the speed and performance of your code. Dask uses the multi-core CPUs within a single system optimally to process hundreds of terabytes of data without the need for additional hardware. Similarly, for moderate to large datasets spanning 100+ gigabytes which often don’t fit into a single storage device, the computing power of the clusters can be coupled with Dask for effective analytics. Supports complex applications: Many companies tend to tackle complex computations by introducing custom codes that run on popular Big Data tools such as Hadoop MapReduce and Apache Spark. However, with the help of the dynamic task schedule feature of Dask, it is now possible to run and process complex applications without introducing any additional code. Dask is solely responsible for the smooth handling of various tasks such as network communication, load balancing and diagnostics, among the others. Clear, responsive, real-time feedback: One of the most important features of Dask is its user-friendliness. Dask provides a real-time dashboard that highlights the key metrics of the processing task undertaken by the user - such as the current progress of your project, memory consumption and more. It also offers an in-built IPython kernel that allows the user to investigate the ongoing computation with just a terminal. How Dask compares with Apache Spark Apache Spark is one of the most popular and widely used Big Data tools for distributed data processing and analytics. Dask and Apache Spark have many features in common, prompting us and many other developers to ask the question - which tool is better? While Spark has been around for quite some and has many standard, stable features over years of development, Dask is quite new and is still being improved as a tool. We summarize the important differences between Dask and Apache Spark in the table below: CriteriaApache SparkDaskPrimary languageScalaPythonScaleSupports a single node to thousands of nodes in the clusterSupports a single node to thousands of nodes in the clusterEcosystemAll-in-one self-sufficient ecosystemIntegration with popular libraries within the Python ecosystemFlexibilityLowHighStream processingBuilt-in module called Spark Streaming presentReal-time interface which is pretty low-level, requires more work than Apache SparkGraph processingPossible with GraphX moduleNot possibleMachine learningUses the Spark MLlib moduleIntegrates with scikit-learn and XGBoostPopularityVery high, commonly used tool in the Big Data ecosystemFairly new tool but has already found its place in the pandas, scikit-learn and Jupyter stack   You can read a detailed comparison of Apache Spark and Dask on the official Dask documentation page. What we can expect from Dask As we saw from the comparison above, it is fairly easy to port an existing Python project using several high-profile Python libraries such as NumPy, scikit-learn and more. Python developers and data scientists will appreciate the high flexibility and complex computational capabilities offered by Dask. The limited stream processing and graph processing features are big areas of improvement, but we can expect some developments in this domain in the near future. Even though Dask is still relatively new, it looks very promising due to its close affinity with the Python ecosystem. With Python’s clout rising, many people would prefer a Python-based data processing tool which works at scale, without having to switch to an external Big Data framework. Dask may well be the superhero to come to the developers’ rescue, in such cases. You can learn more about the latest developments in Dask on their official GitHub page. Read more Is Apache Spark today’s Hadoop? Apache Spark 2.3 now has native Kubernetes support! Should you move to Python 3? 7 Python experts’ opinions
Read more
  • 0
  • 0
  • 3924
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-common-data-science-terms
Aarthi Kumaraswamy
16 May 2018
27 min read
Save for later

30 common data science terms explained

Aarthi Kumaraswamy
16 May 2018
27 min read
Let’s begin at the beginning. What do terms like statistical population, statistical comparison, statistical inference mean? What good is munging, coding, booting, regularization etc. On a scale of 1 to 30 (1 being the lowest and 30, the highest), rate yourself as a data scientist. No matter what you have scored yourself, we hope to have improved that score at least by a little, by the end of this post. Let’s start with a basic question: What is data science? [box type="shadow" align="" class="" width=""]The following is an excerpt from the book, Statistics for Data Science written by James D. Miller and published by Packt Publishing.[/box] The idea of how data science is defined is a matter of opinion. I personally like the explanation that data science is a progression or, even better, an evolution of thought or steps, as shown in the following figure: Although a progression or evolution implies a sequential journey, in practice, this is an extremely fluid process; each of the phases may inspire the data scientist to reverse and repeat one or more of the phases until they are satisfied. In other words, all or some phases of the process may be repeated until the data scientist determines that the desired outcome is reached. Depending on your sources and individual beliefs, you may say the following: Statistics is data science, and data science is statistics. Based upon personal experience, research, and various industry experts' advice, someone delving into the art of data science should take every opportunity to understand and gain experience as well as proficiency with the following list of common data science terms: Statistical population Probability False positives Statistical inference Regression Fitting Categorical data Classification Clustering Statistical comparison Coding Distributions Data mining Decision trees Machine learning Munging and wrangling Visualization D3 Regularization Assessment Cross-validation Neural networks Boosting Lift Mode Outlier Predictive modeling Big data Confidence interval Writing Statistical population You can perhaps think of a statistical population as a recordset (or a set of records). This set or group of records will be of similar items or events that are of interest to the data scientist for some experiment. For a data developer, a population of data may be a recordset of all sales transactions for a month, and the interest might be reporting to the senior management of an organization which products are the fastest sellers and at which time of the year. For a data scientist, a population may be a recordset of all emergency room admissions during a month, and the area of interest might be to determine the statistical demographics for emergency room use. [box type="note" align="" class="" width=""]Typically, the terms statistical population and statistical model are or can be used interchangeably. Once again, data scientists continue to evolve with their alignment on their use of common terms. [/box] Another key point concerning statistical populations is that the recordset may be a group of (actually) existing objects or a hypothetical group of objects. Using the preceding example, you might draw a comparison of actual objects as those actual sales transactions recorded for the month while the hypothetical objects as sales transactions are expected, forecast, or presumed (based upon observations or experienced assumptions or other logic) to occur during a month. Finally, through the use of statistical inference, the data scientist can select a portion or subset of the recordset (or population) with the intention that it will represent the total population for a particular area of interest. This subset is known as a statistical sample. If a sample of a population is chosen accurately, characteristics of the entire population (that the sample is drawn from) can be estimated from the corresponding characteristics of the sample. Probability Probability is concerned with the laws governing random events.                                           -www.britannica.com When thinking of probability, you think of possible upcoming events and the likelihood of them actually occurring. This compares to a statistical thought process that involves analyzing the frequency of past events in an attempt to explain or make sense of the observations. In addition, the data scientist will associate various individual events, studying the relationship of these events. How these different events relate to each other governs the methods and rules that will need to be followed when we're studying their probabilities. [box type="note" align="" class="" width=""]A probability distribution is a table that is used to show the probabilities of various outcomes in a sample population or recordset. [/box] False positives The idea of false positives is a very important statistical (data science) concept. A false positive is a mistake or an errored result. That is, it is a scenario where the results of a process or experiment indicate a fulfilled or true condition when, in fact, the condition is not true (not fulfilled). This situation is also referred to by some data scientists as a false alarm and is most easily understood by considering the idea of a recordset or statistical population (which we discussed earlier in this section) that is determined not only by the accuracy of the processing but by the characteristics of the sampled population. In other words, the data scientist has made errors during the statistical process, or the recordset is a population that does not have an appropriate sample (or characteristics) for what is being investigated. Statistical inference What developer at some point in his or her career, had to create a sample or test data? For example, I've often created a simple script to generate a random number (based upon the number of possible options or choices) and then used that number as the selected option (in my test recordset). This might work well for data development, but with statistics and data science, this is not sufficient. To create sample data (or a sample population), the data scientist will use a process called statistical inference, which is the process of deducing options of an underlying distribution through analysis of the data you have or are trying to generate for. The process is sometimes called inferential statistical analysis and includes testing various hypotheses and deriving estimates. When the data scientist determines that a recordset (or population) should be larger than it actually is, it is assumed that the recordset is a sample from a larger population, and the data scientist will then utilize statistical inference to make up the difference. [box type="note" align="" class="" width=""]The data or recordset in use is referred to by the data scientist as the observed data. Inferential statistics can be contrasted with descriptive statistics, which is only concerned with the properties of the observed data and does not assume that the recordset came from a larger population. [/box] Regression Regression is a process or method (selected by the data scientist as the best fit technique for the experiment at hand) used for determining the relationships among variables. If you're a programmer, you have a certain understanding of what a variable is, but in statistics, we use the term differently. Variables are determined to be either dependent or independent. An independent variable (also known as a predictor) is the one that is manipulated by the data scientist in an effort to determine its relationship with a dependent variable. A dependent variable is a variable that the data scientist is measuring. [box type="note" align="" class="" width=""]It is not uncommon to have more than one independent variable in a data science progression or experiment. [/box] More precisely, regression is the process that helps the data scientist comprehend how the typical value of the dependent variable (or criterion variable) changes when any one or more of the independent variables is varied while the other independent variables are held fixed. Fitting Fitting is the process of measuring how well a statistical model or process describes a data scientist's observations pertaining to a recordset or experiment. These measures will attempt to point out the discrepancy between observed values and probable values. The probable values of a model or process are known as a distribution or a probability distribution. Therefore, a probability distribution fitting (or distribution fitting) is when the data scientist fits a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. The object of a data scientist performing a distribution fitting is to predict the probability or to forecast the frequency of, the occurrence of the phenomenon at a certain interval. [box type="note" align="" class="" width=""]One of the most common uses of fitting is to test whether two samples are drawn from identical distributions.[/box] There are numerous probability distributions a data scientist can select from. Some will fit better to the observed frequency of the data than others will. The distribution giving a close fit is supposed to lead to good predictions; therefore, the data scientist needs to select a distribution that suits the data well. Categorical data Earlier, we explained how variables in your data can be either independent or dependent. Another type of variable definition is a categorical variable. This type of variable is one that can take on one of a limited, and typically fixed, number of possible values, thus assigning each individual to a particular category. Often, the collected data's meaning is unclear. Categorical data is a method that a data scientist can use to put meaning to the data. For example, if a numeric variable is collected (let's say the values found are 4, 10, and 12), the meaning of the variable becomes clear if the values are categorized. Let's suppose that based upon an analysis of how the data was collected, we can group (or categorize) the data by indicating that this data describes university students, and there is the following number of players: 4 tennis players 10 soccer players 12 football players Now, because we grouped the data into categories, the meaning becomes clear. Some other examples of categorized data might be individual pet preferences (grouped by the type of pet), or vehicle ownership (grouped by the style of a car owned), and so on. So, categorical data, as the name suggests, is data grouped into some sort of category or multiple categories. Some data scientists refer to categories as sub-populations of data. [box type="note" align="" class="" width=""]Categorical data can also be data that is collected as a yes or no answer. For example, hospital admittance data may indicate that patients either smoke or do not smoke. [/box] Classification Statistical classification of data is the process of identifying which category (discussed in the previous section) a data point, observation, or variable should be grouped into. The data science process that carries out a classification process is known as a classifier. Read this post: Classification using Convolutional Neural Networks [box type="note" align="" class="" width=""]Determining whether a book is fiction or non-fiction is a simple example classification. An analysis of data about restaurants might lead to the classification of them among several genres. [/box] Clustering Clustering is the process of dividing up the data occurrences into groups or homogeneous subsets of the dataset, not a predetermined set of groups as in classification (described in the preceding section) but groups identified by the execution of the data science process based upon similarities that it found among the occurrences. Objects in the same group (a group is also referred to as a cluster) are found to be more analogous (in some sense or another) to each other than to those objects found in other groups (or found in other clusters). The process of clustering is found to be very common in exploratory data mining and is also a common technique for statistical data analysis. Statistical comparison Simply put, when you hear the term statistical comparison, one is usually referring to the act of a data scientist performing a process of analysis to view the similarities or variances of two or more groups or populations (or recordsets). As a data developer, one might be familiar with various utilities such as FC Compare, UltraCompare, or WinDiff, which aim to provide the developer with a line-by-line comparison of the contents of two or more (even binary) files. In statistics (data science), this process of comparing is a statistical technique to compare populations or recordsets. In this method, a data scientist will conduct what is called an Analysis of Variance (ANOVA), compare categorical variables (within the recordsets), and so on. [box type="note" align="" class="" width=""]ANOVA is an assortment of statistical methods that are used to analyze the differences among group means and their associated procedures (such as variations among and between groups, populations, or recordsets). This method eventually evolved into the Six Sigma dataset comparisons. [/box] Coding Coding or statistical coding is again a process that a data scientist will use to prepare data for analysis. In this process, both quantitative data values (such as income or years of education) and qualitative data (such as race or gender) are categorized or coded in a consistent way. Coding is performed by a data scientist for various reasons such as follows: More effective for running statistical models Computers understand the variables Accountability--so the data scientist can run models blind, or without knowing what variables stand for, to reduce programming/author bias [box type="shadow" align="" class="" width=""]You can imagine the process of coding as the means to transform data into a form required for a system or application. [/box] Distributions The distribution of a statistical recordset (or of a population) is a visualization showing all the possible values (or sometimes referred to as intervals) of the data and how often they occur. When a distribution of categorical data (which we defined earlier in this chapter) is created by a data scientist, it attempts to show the number or percentage of individuals in each group or category. Linking an earlier defined term with this one, a probability distribution, stated in simple terms, can be thought of as a visualization showing the probability of occurrence of different possible outcomes in an experiment. Data mining With data mining, one is usually more absorbed in the data relationships (or the potential relationships between points of data, sometimes referred to as variables) and cognitive analysis. To further define this term, we can say that data mining is sometimes more simply referred to as knowledge discovery or even just discovery, based upon processing through or analyzing data from new or different viewpoints and summarizing it into valuable insights that can be used to increase revenue, cuts costs, or both. Using software dedicated to data mining is just one of several analytical approaches to data mining. Although there are tools dedicated to this purpose (such as IBM Cognos BI and Planning Analytics, Tableau, SAS, and so on.), data mining is all about the analysis process finding correlations or patterns among dozens of fields in the data and that can be effectively accomplished using tools such as MS Excel or any number of open source technologies. [box type="note" align="" class="" width=""]A common technique to data mining is through the creation of custom scripts using tools such as R or Python. In this way, the data scientist has the ability to customize the logic and processing to their exact project needs. [/box] Decision trees A statistical decision tree uses a diagram that looks like a tree. This structure attempts to represent optional decision paths and a predicted outcome for each path selected. A data scientist will use a decision tree to support, track, and model decision making and their possible consequences, including chance event outcomes, resource costs, and utility. It is a common way to display the logic of a data science process. Machine learning Machine learning is one of the most intriguing and exciting areas of data science. It conjures all forms of images around artificial intelligence which includes Neural Networks, Support Vector Machines (SVMs), and so on. Fundamentally, we can describe the term machine learning as a method of training a computer to make or improve predictions or behaviors based on data or, specifically, relationships within that data. Continuing, machine learning is a process by which predictions are made based upon recognized patterns identified within data, and additionally, it is the ability to continuously learn from the data's patterns, therefore continuingly making better predictions. It is not uncommon for someone to mistake the process of machine learning for data mining, but data mining focuses more on exploratory data analysis and is known as unsupervised learning. Machine learning can be used to learn and establish baseline behavioral profiles for various entities and then to find meaningful anomalies. Here is the exciting part: the process of machine learning (using data relationships to make predictions) is known as predictive analytics. Predictive analytics allow the data scientists to produce reliable, repeatable decisions and results and uncover hidden insights through learning from historical relationships and trends in the data. Munging and wrangling The terms munging and wrangling are buzzwords or jargon meant to describe one's efforts to affect the format of data, recordset, or file in some way in an effort to prepare the data for continued or otherwise processing and/or evaluations. With data development, you are most likely familiar with the idea of Extract, Transform, and Load (ETL). In somewhat the same way, a data developer may mung or wrangle data during the transformation steps within an ETL process. Common munging and wrangling may include removing punctuation or HTML tags, data parsing, filtering, all sorts of transforming, mapping, and tying together systems and interfaces that were not specifically designed to interoperate. Munging can also describe the processing or filtering of raw data into another form, allowing for more convenient consumption of the data elsewhere. Munging and wrangling might be performed multiple times within a data science process and/or at different steps in the evolving process. Sometimes, data scientists use munging to include various data visualization, data aggregation, training a statistical model, as well as much other potential work. To this point, munging and wrangling may follow a flow beginning with extracting the data in a raw form, performing the munging using various logic, and lastly, placing the resulting content into a structure for use. Although there are many valid options for munging and wrangling data, preprocessing and manipulation, a tool that is popular with many data scientists today is a product named Trifecta, which claims that it is the number one (data) wrangling solution in many industries. [box type="note" align="" class="" width=""]Trifecta can be downloaded for your personal evaluation from https://www.trifacta.com/. Check it out! [/box] Visualization The main point (although there are other goals and objectives) when leveraging a data visualization technique is to make something complex appear simple. You can think of visualization as any technique for creating a graphic (or similar) to communicate a message. Other motives for using data visualization include the following: To explain the data or put the data in context (which is to highlight demographic statistics) To solve a specific problem (for example, identifying problem areas within a particular business model) To explore the data to reach a better understanding or add clarity (such as what periods of time do this data span?) To highlight or illustrate otherwise invisible data (such as isolating outliers residing in the data) To predict, such as potential sales volumes (perhaps based upon seasonality sales statistics) And others Statistical visualization is used in almost every step in the data science process, within the obvious steps such as exploring and visualizing, analyzing and learning, but can also be leveraged during collecting, processing, and the end game of using the identified insights. D3 D3 or D3.js, is essentially an open source JavaScript library designed with the intention of visualizing data using today's web standards. D3 helps put life into your data, utilizing Scalable Vector Graphics (SVG), Canvas, and standard HTML. D3 combines powerful visualization and interaction techniques with a data-driven approach to DOM manipulation, providing data scientists with the full capabilities of modern browsers and the freedom to design the right visual interface that best depicts the objective or assumption. In contrast to many other libraries, D3.js allows inordinate control over the visualization of data. D3 is embedded within an HTML webpage and uses pre-built JavaScript functions to select elements, create SVG objects, style them, or add transitions, dynamic effects, and so on. Regularization Regularization is one possible approach that a data scientist may use for improving the results generated from a statistical model or data science process, such as when addressing a case of overfitting in statistics and data science. [box type="note" align="" class="" width=""]We defined fitting earlier (fitting describes how well a statistical model or process describes a data scientist's observations). Overfitting is a scenario where a statistical model or process seems to fit too well or appears to be too close to the actual data.[/box] Overfitting usually occurs with an overly simple model. This means that you may have only two variables and are drawing conclusions based on the two. For example, using our previously mentioned example of daffodil sales, one might generate a model with temperature as an independent variable and sales as a dependent one. You may see the model fail since it is not as simple as concluding that warmer temperatures will always generate more sales. In this example, there is a tendency to add more data to the process or model in hopes of achieving a better result. The idea sounds reasonable. For example, you have information such as average rainfall, pollen count, fertilizer sales, and so on; could these data points be added as explanatory variables? [box type="note" align="" class="" width=""]An explanatory variable is a type of independent variable with a subtle difference. When a variable is independent, it is not affected at all by any other variables. When a variable isn't independent for certain, it's an explanatory variable. [/box] Continuing to add more and more data to your model will have an effect but will probably cause overfitting, resulting in poor predictions since it will closely resemble the data, which is mostly just background noise. To overcome this situation, a data scientist can use regularization, introducing a tuning parameter (additional factors such as a data points mean value or a minimum or maximum limitation, which gives you the ability to change the complexity or smoothness of your model) into the data science process to solve an ill-posed problem or to prevent overfitting. Assessment When a data scientist evaluates a model or data science process for performance, this is referred to as assessment. Performance can be defined in several ways, including the model's growth of learning or the model's ability to improve (with) learning (to obtain a better score) with additional experience (for example, more rounds of training with additional samples of data) or accuracy of its results. One popular method of assessing a model or processes performance is called bootstrap sampling. This method examines performance on certain subsets of data, repeatedly generating results that can be used to calculate an estimate of accuracy (performance). The bootstrap sampling method takes a random sample of data, splits it into three files--a training file, a testing file, and a validation file. The model or process logic is developed based on the data in the training file and then evaluated (or tested) using the testing file. This tune and then test process is repeated until the data scientist is comfortable with the results of the tests. At that point, the model or process is again tested, this time using the validation file, and the results should provide a true indication of how it will perform. [box type="note" align="" class="" width=""]You can imagine using the bootstrap sampling method to develop program logic by analyzing test data to determine logic flows and then running (or testing) your logic against the test data file. Once you are satisfied that your logic handles all of the conditions and exceptions found in your testing data, you can run a final test on a new, never-before-seen data file for a final validation test. [/box] Cross-validation Cross-validation is a method for assessing a data science process performance. Mainly used with predictive modeling to estimate how accurately a model might perform in practice, one might see cross-validation used to check how a model will potentially generalize, in other words, how the model can apply what it infers from samples to an entire population (or recordset). With cross-validation, you identify a (known) dataset as your validation dataset on which training is run along with a dataset of unknown data (or first seen data) against which the model will be tested (this is known as your testing dataset). The objective is to ensure that problems such as overfitting (allowing non-inclusive information to influence results) are controlled and also provide an insight into how the model will generalize a real problem or on a real data file. The cross-validation process will consist of separating data into samples of similar subsets, performing the analysis on one subset (called the training set) and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple iterations (also called folds or rounds) of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Typically, a data scientist will use a models stability to determine the actual number of rounds of cross-validation that should be performed. Neural networks Neural networks are also called artificial neural networks (ANNs), and the objective is to solve problems in the same way that the human brain would. Google will provide the following explanation of ANN as stated in Neural Network Primer: Part I, by Maureen Caudill, AI Expert, Feb. 1989: [box type="note" align="" class="" width=""]A computing system made up of several simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. [/box] To oversimplify the idea of neural networks, recall the concept of software encapsulation, and consider a computer program with an input layer, a processing layer, and an output layer. With this thought in mind, understand that neural networks are also organized in a network of these layers, usually with more than a single processing layer. Patterns are presented to the network by way of the input layer, which then communicates to one (or more) of the processing layers (where the actual processing is done). The processing layers then link to an output layer where the result is presented. Most neural networks will also contain some form of learning rule that modifies the weights of the connections (in other words, the network learns which processing nodes perform better and gives them a heavier weight) per the input patterns that it is presented with. In this way (in a sense), neural networks learn by example as a child learns to recognize a cat from being exposed to examples of cats. Boosting In a manner of speaking, boosting is a process generally accepted in data science for improving the accuracy of a weak learning data science process. [box type="note" align="" class="" width=""]Data science processes defined as weak learners are those that produce results that are only slightly better than if you would randomly guess the outcome. Weak learners are basically thresholds or a 1-level decision tree. [/box] Specifically, boosting is aimed at reducing bias and variance in supervised learning. What do we mean by bias and variance? Before going on further about boosting, let's take note of what we mean by bias and variance. Data scientists describe bias as a level of favoritism that is present in the data collection process, resulting in uneven, disingenuous results and can occur in a variety of different ways. A sampling method is called biased if it systematically favors some outcomes over others. A variance may be defined (by a data scientist) simply as the distance from a variable mean (or how far from the average a result is). The boosting method can be described as a data scientist repeatedly running through a data science process (that has been identified as a weak learning process), with each iteration running on different and random examples of data sampled from the original population recordset. All the results (or classifiers or residue) produced by each run are then combined into a single merged result (that is a gradient). This concept of using a random subset of the original recordset for each iteration originates from bootstrap sampling in bagging and has a similar variance-reducing effect on the combined model. In addition, some data scientists consider boosting a means to convert weak learners into strong ones; in fact, to some, the process of boosting simply means turning a weak learner into a strong learner. Lift In data science, the term lift compares the frequency of an observed pattern within a recordset or population with how frequently you might expect to see that same pattern occur within the data by chance or randomly. If the lift is very low, then typically, a data scientist will expect that there is a very good probability that the pattern identified is occurring just by chance. The larger the lift, the more likely it is that the pattern is real. Mode In statistics and data science, when a data scientist uses the term mode, he or she refers to the value that occurs most often within a sample of data. Mode is not calculated but is determined manually or through processing of the data. Outlier Outliers can be defined as follows: A data point that is way out of keeping with the others That piece of data that doesn't fit Either a very high value or a very low value Unusual observations within the data An observation point that is distant from all others Predictive modeling The development of statistical models and/or data science processes to predict future events is called predictive modeling. Big Data Again, we have some variation of the definition of big data. A large assemblage of data, data sets that are so large or complex that traditional data processing applications are inadequate, and data about every aspect of our lives have all been used to define or refer to big data. In 2001, then Gartner analyst Doug Laney introduced the 3V's concept. The 3V's, as per Laney, are volume, variety, and velocity. The V's make up the dimensionality of big data: volume (or the measurable amount of data), variety (meaning the number of types of data), and velocity (referring to the speed of processing or dealing with that data). Confidence interval The confidence interval is a range of values that a data scientist will specify around an estimate to indicate their margin of error, combined with a probability that a value will fall in that range. In other words, confidence intervals are good estimates of the unknown population parameter. Writing Although visualizations grab much more of the limelight when it comes to presenting the output or results of a data science process or predictive model, writing skills are still not only an important part of how a data scientist communicates but still considered an essential skill for all data scientists to be successful. Did we miss any of your favorite terms? Now that you are at the end of this post, we ask you again: On a scale of 1 to 30 (1 being the lowest and 30, the highest), how do you rate yourself as a data scientist? Why You Need to Know Statistics To Be a Good Data Scientist [interview] How data scientists test hypotheses and probability 6 Key Areas to focus on while transitioning to a Data Scientist role Soft skills every data scientist should teach their child
Read more
  • 0
  • 0
  • 7604

article-image-analyzing-the-chief-data-officer-role
Aaron Lazar
13 Apr 2018
9 min read
Save for later

GDPR is pushing the Chief Data Officer role center stage

Aaron Lazar
13 Apr 2018
9 min read
Gartner predicted that by 2020 90% of large organizations in regulated industries will have a Chief Data Officer role. With the recent heat around Facebook, Mark Zuckerberg, and the fast-approaching GDPR compliance deadline, it’s quite likely that 2018 will be the year of the Chief Data Officer. This article was first published in October, 2017 and has been updated to keep up with the latest trends in GDPR. In 2014 around 400 CEOs and top business execs were asked how they recognize data as a corporate asset. They responded with a mixed set of reactions and viewed the worth of data in their organization in varied ways. Now, in 2018 these reactions have drastically changed – more and more organizations have realized the importance of Data-as-an-Asset. More importantly, the European Union (EU) has made it mandatory that General Data Protection Regulation (GDPR) compliance be sought by the 25th of May, 2018. The primary reason for creating the Chief Data Officer role was to connect the ends between functional management and IT teams in an organization. But now, it looks like the CDO will also be primarily focusing on setting up and driving GDPR compliance, to avoid a fine up to €20 million or 4% of the annual global turnover of the previous year, whichever is higher. We’re going to spend a few minutes breaking down the Chief Data Officer role for you, revealing several interesting insights along the way. Let’s start with the obvious question. What might the Chief Data Officer’s responsibilities be? Like other C-suite execs, a Chief Data Officer is expected to have a well-blended mix of technical know-how and business acumen. Their role is very diverse and sometimes comes across as a pain to define the scope. Here are some of the key responsibilities of a Chief Data Officer: Data Policies and GDPR Compliance Data security is one of the most important elements that any business must consider. It needs to comply with regulatory standards and requirements of the country where it operates. A Chief Data Officer is be responsible for ensuring the compliance of policies across all branches of business and the associated compliance requirement taxonomies, on a global level. What is GDPR? If you’re thinking there were no data protection laws before the General Data Protection Regulation 2016/679, that’s not true. The major difference, however, is that GDPR focuses more on customer data privacy and protection. GDPR requirements will change the way organizations store, process and protect their customers’ personal data. They will need solutions for assessing, implementing and maintaining GDPR compliance, and that’s where Chief Data Officers fit in. Using data to gain a competitive edge Chief Data Officers need to have sound knowledge of the business’ customers, markets where they operate, and strong analytical skills to leverage the right data, at the right time, at the right place. This would eventually give the business an edge over its competitors in the market. For example, a ferry service could use data to identify the rates that customers would be willing to pay at a certain time of the day. Setting in motion the best practices of data governance Organizations span across the globe these days and often employees from different parts of the world work on the same data. This can often result in data moving through unconnected systems, and ending up as inefficient or disjointed pieces of business information. A Chief Data Officer needs to ensure that this information is aggregated and maintained in such a way that clear information ownership across the organization is established. Architecting future-proof data solutions A Chief Data Officer often acts like a Data Architect. They will sometimes take on responsibility for planning, designing, and building Big Data systems and ensuring successful integration with other systems in an organization. Designing systems that can provide answers to the user’s problems now and in the future is vital. Chief Data Officers are often found asking themselves key questions like how to generate data with maximum reusability, while also making sure it’s as accurate and relevant as possible. Defining Information Management tools Different business units across the globe tend to use different tools, technologies, and systems to work on, store, and share information on an enterprise level. This greatly affects a company’s ability to access and leverage data for effective decision making and various other duties. A Chief Data Officer is responsible for establishing data-oriented standards across the business and for driving all arms of the business to comply with the standards and embrace change, to ensure the integrity of data. Spotting new opportunities A Chief Data Officer is responsible for spotting new opportunities where the business can venture into through careful analysis of data and past records. For example, a motor company would leverage certain sales information to make an informed decision on what age group to target with their new SUV in-the-making, to maximize sales. These are just the tip of the iceberg when it comes to a Chief Data Officer’s responsibilities. Responsibilities go hand-in-hand with skill and traits required to execute those responsibilities. Below are some key capabilities sought after in a CDO. Key Chief Data Officer Skills The right person for the job is expected to possess impeccable leadership and C-Suite level communication skills, as well as strong business acumen. They are expected to have strong knowledge of GDPR software tools and solutions to enable the organizations hiring them to swiftly transform and adopt the new regulations. They are also expected to possess knowledge of IT architecture, including a familiarity with leading architectural standards such as TOGAF or the Zachman framework. They need to be experienced in driving data governance as well as data quality and integrity, while also possessing a strong knowledge of data analytics, visualization, and storytelling. Familiar with Big Data solutions like Hadoop, MapReduce, and HBase is a plus. How much do Chief Data Officers earn?  Now, let’s take a look at what kind of compensation a Chief Data Officer is likely to be offered. To tell you the truth, the answer to this question is still a bit hazy, but it’s sure to pick up speed with the recent developments in the regulatory and legal areas related to data. About a year ago, a blog post from careeraddict revealed that the salary for a CDO in the US was around $112,000 annually. A job listing seen on Indeed quoted $200,000 as the annual salary. Indeed shows 7 jobs posted for a CDO in the last 15 days. We took the estimated salaries of CDOs and compared them with those of CIOs and CTOs in the same company. It turns out that most were on par, with a few CDO compensations falling slightly short of the CIO and CTO salary. These are just basic salary figures. Bonuses add on the side, amounting up to 50% in some cases. However, please note that these salary figures vary heavily based on the type of organization and the industry. Do businesses even need a Chief Data Officer? One might argue that some of the skills expected of a Chief Data Officer would also be held by the CIO or the Chief Digital Officer of the organization or the Data Protection Officer (if they have one). Then, why have a Chief Data Officer at all and incur an extra significant cost to the company? With the rapid change in tech and the rate at which data is generated, used, and discarded, most data pointers, point in the direction of having a separate Chief Data Officer, working alongside the CIO. It’s critical to have a clearly defined need for both roles to co-exist. Blurring the boundaries of the two roles can be detrimental and organizations must, therefore, be painstakingly mindful of the defined KRAs. The organisation should clearly define the two roles to keep the business structure running smoothly. A Chief Data Officer’s main focus will be on the latest data-centric technological innovations, their compliance to the new standards while also boosting customer engagement, privacy and in turn, loyalty and the business’s competitive advantage. The CIO, on the other hand, focuses on improving the bottom line by owning business productivity metrics, cost-cutting initiatives, making IT investments etc. – i.e., an inward facing data-management and architecture role. The CIO is the person who is therefore responsible for leading digital initiatives at a board level. In addition to managing data and governing information, if the CIO’s responsibilities were to also include implementing analytics in fresh ways to generate value for the business, it is going to be a tall order. To put it simply, it is more practical for the CIO to own the systems and the CDO to oversee all the bits and bytes that flow through these systems. Moreover, in several cases, the Chief Data Officer will act as a liaison between the business and IT. Thus, Chief Data Officers and CIOs both need to work together and support each other for a better functioning business. The bottom line: a Chief Data Officer is essential For an organization dealing with a lot of data, a Chief Data Officer is a must. Failure to have one on board can result in being fined €10 million euros or 2% of the organization’s worldwide turnover (depending on which is higher). Here are the criteria for an organization to have a dedicated personnel managing Data Protection. The organization’s core activities should: Have data processing operations which require regular and systematic monitoring of data subjects on a large scale or monitoring of individuals Be processing a large scale of special categories of data (i.e. sensitive data such as health, religion, race, sexual orientation etc.) Have data processing carried out by a public authority or a body processing personal data, except for courts operating in their judicial capacity Apart from this mandate, a Chief Data Officer can add immense value by aligning data-driven insights with its vision and goals. A CDO can bridge the gap between the CMO and the CIO, by focusing on meeting customer requirements through data-driven products. For those in data and insights centric roles such as data scientists, data engineers, data analysts and others, the CDO is a natural destination for their career progression journey. The Chief Data Officer role is highly attractive in terms of the scope of responsibilities, the capabilities and of course, the pay. Certification courses like this one are popping up to help individuals shape themselves for the role. All-in-all, this new C-suite position in most organizations, is the perfect pivot between old and new, bridging silos, and making a future where data privacy is intact.
Read more
  • 0
  • 0
  • 5981

article-image-why-data-science-needs-great-communicators
Erik Kappelman
16 Jan 2018
4 min read
Save for later

Why data science needs great communicators

Erik Kappelman
16 Jan 2018
4 min read
One of the biggest problems facing data science (and many other technical industries) today is communication. This is true on both an individual level, but at a much bigger organizational and cultural level. On the one hand, we can all be better communicators, but at the same time organizations and businesses can do a lot more to facilitate knowledge and information sharing. At an individual level, it’s important to recognize that some people find communication very difficult. Obviously it’s a cliché that many of these people find themselves in technical industries, and while we shouldn’t get stuck on stereotypes, there is certainly an element of truth in it. The reasons why this might be the case is incredibly complex, but it may be true that part of the problem is how technology has been viewed within institutions and other organizations. This is the sort of attitude that says “those smart people just have bad social skills. We should let them do what they're good at and leave them alone.” There are lots of problems with this and it isn’t doing anyone any favors, from the people that struggle with communication to the organizations who encourage this attitude. Statistics and communicating insights Let’s take a field like statistics. There is a notion that you do not need to be good at communicating to be good at statistics; it is often viewed as a primarily numerical and technical skill. However, when you think about what statistics really is, it becomes clear that that is nonsensical. The primary purpose of the field is to tease out information and insights from noisy data and then communicate those insights. If you don’t do that you’re not doing statistics. Some forms of communication are inherent to statistical research; graphs and charts communicate the meaning of data and most statisticians or data scientists have a well worn skill of chart making. But there’s more than just charts – great visualizations, great presentations can all be the work of talented statisticians.  Of course, there are some data-related roles where communication is less important. If you’re working on data processing and storage, for example, being a great communicator may not be quite as valuable. But consider this: if you can’t properly discuss and present why you’re doing what you’re doing to the people that you work with and the people that matter in your organization you’re immediately putting up a barrier to success. The data explosion makes communication even more important There is an even bigger reason data science needs great communicators and it has nothing to do with individual success. We have entered what I like to call the Data Century. Computing power and tools using computers, like the Internet, hit a sweet spot somewhere around the new millennium and the data and analysis now available to the world is unprecedented. Who knows what kind of answers this mass of data holds? Data scientists are at the frontier of the greatest human exploration since the settling of the New World. This exploration is faced inward, as we try to understand how and why human beings do various things by examining the ever growing piles of data. If data scientists cannot relay their findings, we all miss out on this wonderful time of exploration and discovery. People need data scientists to tell them about the whole new world of data that we are just entering. It would be a real shame if the data scientists didn’t know how. Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for the Department of Transportation as a transportation demand modeler.
Read more
  • 0
  • 0
  • 4562

article-image-5-data-science-tools-matter-2018
Richard Gall
12 Dec 2017
3 min read
Save for later

5 data science tools that will matter in 2018

Richard Gall
12 Dec 2017
3 min read
We know your time is valuable. That's why what matters is important. We've written about the trends and issues that are going to matter in data science, but here you can find 5 data science tools that you need to pay attention to in 2018. Read our 5 things that matter in data science in 2018 here. 1. TensorFlow Google's TensorFlow has been one of the biggest hits of 2017 when it comes to libraries. It’s arguably done a lot to make machine learning more accessible than ever before. That means more people actually building machine learning and deep learning algorithms, and the technology moving beyond the domain of data professionals and into other fields. So, if TensorFlow has passed you by we recommend you spend some time exploring it. It might just give your skill set the boost you’re looking for. Explore TensorFlow content here. 2.Jupyter Jupyter isn’t a new tool, sure. But it’s so crucial to the way data science is done that it’s importance can’t be understated. And as pressure is placed on data scientists and analysts to communicate and share data in ways that empower stakeholders in a diverse range of roles and departments. It’s also worth mentioning its relationship with Python - we’ve seen Python go from strength to strength throughout 2017, and showing no signs of letting up; the close relationship between the two will only serve to make Jupyter more popular across the data science world. Discover Jupyter eBooks and videos here. 3. Keras In a year when deep learning has captured the imagination, it makes sense to include both libraries helping to power it. It’s a close call between Keras and TensorFlow which deep learning framework is ‘better’ - ultimately, like everything, it’s about what you’re trying to do. This post explores the difference between Keras and TensorFlow very well - the conclusion is ultimately that while TensorFlow offers more ‘control’, Keras is the library you want if you simply need to get up and running. Both libraries have had a huge impact in 2017, and we’re only going to be seeing more of them in 2018. Learn Keras. Read Deep Learning with Keras. 4. Auto SkLearn Automated machine learning is going to become incredibly important in 2018. As pressure mounts on engineers and analysts to do more with less, tools like Auto SKLearn will be vital in reducing some of the ‘manual labour’ of algorithm selection and tuning. 5. Dask This one might be a little unexpected. We know just how popular Apache Spark is when it comes to distributed and parallel computing, but Dask represents an interesting competitor that’s worth watching throughout 2018. It’s high-level API integrates exceptionally well with Python libraries like NumPy and pandas; it’s also much more lightweight than Spark, so it could be a good option if you want to avoid building out a weighty big data tech stack. Explore Dask in the latest edition of Python High Performance.
Read more
  • 0
  • 0
  • 3238
article-image-5-things-that-matter-data-science-2018
Richard Gall
11 Dec 2017
4 min read
Save for later

5 things that will matter in data science in 2018

Richard Gall
11 Dec 2017
4 min read
The world of data science is now starting to change quickly. This was arguably the year when discussions around AI and automation started to escalate, taking on more and more importance in the public sphere. But as interesting as all that is, there are nevertheless real people - like you - actually working with data not to rig elections or steal someone’s jobs but simply to make things better. Arguably, data science and analysis has never been under the spotlight to the extent it is today. Whereas a decade ago there was a whole lotta hope stored in the big data revolution, today there’s anxiety that we’re not doing enough with data, that we don’t have the right data. That makes it a challenging but important time to be working in the world of data. With that in mind, here are our 5 things that will matter in data science in 2018… Find out what 5 data science tools we think will matter most in 2018 here. 1. The ethical considerations in machine learning and artificial intelligence This is huge, but it can’t be ignored. At its heart, this is important because it highlights that there’s human agency at the heart of modern data science, that algorithms are things created and designed by the engineers behind them. But even more important than that, these ethical considerations will be important in 2018 because it will end up defining everyone’s relationship to data for decades to come. And yes, although legislative bodies may play a part in that, it’s also up to people actually working with data to contribute to that discussion about what data does, who uses it and why. That might sound like a lot of responsibility, but it makes things pretty exciting, no? 2. Greater alignment between data projects and business goals This has long been a challenge for just about every business and, indeed, anyone who works in data - architect, analyst or scientist. But as the data hype curve flattens out with more organizations taking advantage of the opportunities it offers, with budgets getting tighter and expectations higher than they have ever been, ensuring that data programs are delivering real value will be crucial in 2018. That means there will be more pressure on data pros to deliver; sharpening your commercial instincts will be essential, and could be your route to the next step in your career. 3. Automated machine learning If budgets are getting tighter and management expectations are higher than ever, the emergence of automated machine learning will be a godsend for 2018. Automated machine learning isn’t a threat to anyone’s job - it’s simply a way of making the steps of algorithm selection and optimization much faster. If you’ve ever lamented the time you’ve spent tweaking an algorithm only for it not to work as you wanted it to, only to move to a further iteration to find a similar problem, automated machine learning will automate away all those iterations. What this means is that you’ll be able to spend more time on value-adding activities that will never be automated away. And in turn this will make you a more valuable data scientist. 4. Taking advantage of cloud Cloud has been a big trend for some years now. But as a word on it’s own it’s always felt a bit abstract and amorphous. However, it’s once you start to see how it can be put into practice that you begin to see how potentially transformative it might be. In the case of machine learning, cloud becomes a vital solution in the battle for resources - it makes machine learning at scale more accessible to more people. The key tool here is Google’s cloud machine learning engine - it’s been built to make building machine learning models as straightforward as possible. When you look at this alongside automated machine learning, it’s possible to suggest that the data science skill set might change somewhat throughout 2018… 5. Better self-service BI 2018 is the year when all employees will need to be empowered by data. The idea that a specific team handles everything relating to data will end; using data will be crucial to a range of different stakeholders. This doesn’t mean the end of the data scientist - as said earlier, no one is going to be losing their jobs. But it does mean that self-service BI tools are going to take on greater importance than ever before in 2018. That means data scientists may have to start thinking more like data architects (especially if there’s no data architect in their organization), and taking into consideration how they make their work accessible and meaningful for stakeholders all around their organization.
Read more
  • 0
  • 0
  • 2980

article-image-whats-difference-between-data-scientist-and-data-analyst
Erik Kappelman
10 Oct 2017
5 min read
Save for later

What's the difference between a data scientist and a data analyst

Erik Kappelman
10 Oct 2017
5 min read
It sounds like a fairly pedantic question to ask what the difference between a data scientist and data analyst is. But it isn't - in fact, it's a great question that illustrates the way data-related roles have evolved in businesses today. It's pretty easy to confuse the two job roles - there's certainly a lot of misunderstanding on the difference between a data scientist and a data analyst even within a managerial environment. Comparing data analysts and data scientists Data analysts are going to be dealing with data that you might remember from your statistics classes. This data might come from survey results, lab experiments of various sorts, longitudinal studies, or another form of social observation. Data may also come from observation of natural or created phenomenons, but the data’s form would still be similar. Data scientists on the other hand, are going to looking at things like metadata from billions of phone calls, data used to forecast Bitcoin prices that have been scraped from various places around the Internet, or maybe data related to Internet searches before and after some important event. So their data is often different, but is that all? The tools and skillset required for each is actually quite different as well. Data science is much more entwined with the field of computer science than data analysis. A good data analyst should have working knowledge of how computers, networks, and the Internet function, but they don’t need to be an expert in any of these things. Data analyst really just need to know a good scripting language that is used to handle data, like Python or R, and maybe a more mathematically advanced tool like MatLab or Mathematica for more advanced modeling procedures. A data analyst could have a fruitful career knowing only about that much in the realm of technology. Data scientists, however, need to know a lot about how networks and the Internet work. Most data scientists will need to have mastered HTTP, HTML, XML and SQL as well as scripting languages like Ruby or Python, and also object-oriented languages like Java or C. This is because data scientists spend a lot more time capturing, manipulating, storing and moving around data than a data analyst would. These tasks require a different skillset. Data analysts and data scientists have different forms of conceptual understanding There will also likely be a difference in the conceptual understanding of a data analyst versus a data scientist. If you were to ask both a data scientist and a data analyst to derive and twice differentiate the log likelihood function of the binomial logistic regression model, it is more likely the data analyst would be able to do it. I would expect data analysts to have a better theoretical understanding of statistics than a data scientist. This is because data scientists don’t really need much theoretical understanding in order to be effective. A data scientist would be better served by learning more about capturing data and analyzing streams of data than theoretical statistics. Differences are not limited to knowledge or skillset, how data scientists and data analysts approach their work is also different. Data analysts generally know what they are looking for as they begin their analysis. By this I mean, a data analyst may be given the results of a study of a new drug, and the researcher may ask the analyst to explore and hopefully quantify the impact of a new drug. A data analyst would have no problem performing this task. A data scientist on the other hand, could be given the task of analyzing locations of phone calls and finding any patterns that might exist. For the data scientist, the goal is often less defined than it is for a data analyst. In fact, I think this is the crux of the entire difference. Data scientists perform far more exploratory data analysis than their data analyst cousins. This difference in approach really explains the difference in skill sets. Data scientists have skill sets that are primarily geared toward extracting, storing and finding uses for data. The skill set to perform these tasks is the skill set of a data scientist. Data analysts primarily analyze data and their skill set reflects this. Just to add one more little wrinkle, while calling a data scientist a data analyst is basically correct, calling a data analyst a data scientist is probably not correct. This is because the data scientist is going to have a handle on more of the skills required of a data analyst than a data analyst would of a data scientist. This is another reason there is so much confusion around this subject. Clearing up the difference between a data scientist and data analyst So now, hopefully, you can tell the difference between a data scientist and a data analyst. I don’t believe either field is superior to the other. If you are choosing between which field you would like to pursue, what’s important is that you choose the field that best compliments your skill set. Luckily it's hard to go wrong because both data scientists and analysts usually have interesting and rewarding careers.
Read more
  • 0
  • 0
  • 3401

article-image-top-5-misconceptions-about-data-science
Erik Kappelman
02 Oct 2017
6 min read
Save for later

Top 5 misconceptions about data science

Erik Kappelman
02 Oct 2017
6 min read
Data science is a well-defined, serious field of study and work. But the term ‘data science’ has become a bit of a buzzword. Yes, 'data scientists’ have become increasingly important to many different types of organizations, but it has also become a trend term in tech recruitment. The fact that these words are thrown around so casually has led to a lot of confusion about what data science and data scientists actually is and are. I would formerly include myself in this group. When I first heard the word data scientist, I assumed that data science was actually just statistics in a fancy hat. Turns out I was quite wrong. So here are the top 5 misconceptions about data science. Data science is statistics and vice versa I fell prey to this particular misconception myself. What I have come to find out is that statistical methods are used in data science, but conflating the two is really inaccurate. This would be somewhat like saying psychology is statistics because research psychologists use statistical tools in studies and experiments. So what's the difference? I am of the mind that the primary difference lies in the level of understanding of computing required to succeed in each discipline. While many statisticians have an excellent understanding of things like database design, one could be a statistician and actually know nothing about database design. To succeed as a statistician, all the way up to the doctoral level, you really only need to master basic modeling tools like R, Python, and MatLab. A data scientist needs to be able to mine data from the Internet, create machine learning algorithms, design, build and query databases and so on. Data science is really computer science This is the other half of the first misconception. While it is tempting to lump data science in with computer science, the two are quite different. For one thing, computer science is technically a field of mathematics focused on algorithms and optimization, and data science is definitely not that. Data science requires many skills that overlap with those of computer scientists, but data scientists aren’t going to need to know anything about computer hardware, kernels, and the like. A data scientist ought to have some understanding of network protocols, but even here, the level of understanding required for data science is nothing like the understanding held by the average computer scientist. Data scientists are here to replace statisticians In this case, nothing could be further from the truth. One way to keep this straight is that statisticians are in the business of researching existing statistical tools as well as trying to develop new statistical tools. These tools are then turned around and used by data scientists and many others. Data scientists are usually more focused on applied solutions to real problems and less interested in what many might regard as pure research. Data science is primarily focused on big data This is an understandable misconception. Just so we’re clear, Wikipedia defines big data as “a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them.” Then big data is really just the study of how to deal with, well, big datasets. Data science absolutely has a lot to contribute in this area. Data scientists usually have skills that work really well when it comes to analyzing big data. Skills related to databases, machine learning, and how data is transferred around a local network or the internet, are skills most data scientists have, and are very helpful when dealing with big data. But data science is actually very broad in scope. big data is a hot topic right now and receiving a lot of attention. Research into the field is receiving a lot private and public funding. In any situation like this, many different types of people working in a diverse range of areas are going to try to get in on the action. As a result, talking up data science's connection to big data makes sense if you're a data scientist - it's really about effective marketing. So, you might work with big data if you're a data scientist - but data science is also much, much more than just big data. Data scientists can easily find a job I thought I would include this one to add a different perspective. While there are many more misconceptions about what data science is or what data scientists do, I think this is actually a really damaging misconception and should be discussed. I hear a lot of complaints these days from people with some skill set that is sought after not being able to find gainful employment. Data science is like any other field, and there is always going to be a whole bunch of people that are better at it than you. Don’t become a data scientist because you’re sure to get a job - you’re not. The industries related to data science are absolutely growing right now, and will continue to do so for the foreseeable future. But that doesn’t mean people who can call themselves data scientists just automatically get jobs. You have to have the talent, but you also need to network and do all the same things you need to do to get on in any other industry. The point is, it's not easy to get a job no matter what your field is; study and practice data science because it's awesome, don’t do it because you heard it’s a sure way to get a job. Misconceptions abound, but data science is a wonderful field of research, study, and practice. If you are interested in pursuing a career or degree related to data science, I encourage you to do so, however, make sure you have the right idea about what you’re getting yourself into. Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for theDepartment of Transportation as a transportation demand modeler.
Read more
  • 0
  • 0
  • 3111
article-image-data-science-getting-easier
Erik Kappelman
10 Sep 2017
5 min read
Save for later

Is data science getting easier?

Erik Kappelman
10 Sep 2017
5 min read
The answer is yes, and no. This is a question that could've easily been applied to textile manufacturing in the 1890s, and could've received a similar answer. By this I mean, textile manufacturing improved leaps and bounds throughout the industrial revolution, however, despite their productivity, textile mills were some of the most dangerous places to work. Before I further explain my answer, let’s agree on a definition for data science. Wikipedia defines data science as, “an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured.” I see this as the process of acquiring, managing, and analyzing data. Advances in data science First, let's discuss why data science is definitely getting easier. Advances in technology and data collection have made data science easier. For one thing, data science as we know it wasn’t even possible 40 years ago, but due to advanced technology we can now analyze, gather, and manage data in completely new ways. Scripting languages like R and Python have mostly replaced more convoluted languages like Haskell and Fortran in the realm of data analysis. Tools like Hadoop bring together a lot of different functionality to expedite every element of data science. Smartphones and wearable tech collect data more effectively and efficiently than older data collection methods, which gives data scientists more data of higher quality to work with. Perhaps most importantly, the utility of data science has become more and more recognized throughout the broader world. This helps provide data scientists the support they need to be truly effective. These are just some of the reasons why data science is getting easier. Unintended consequences While many of these tools make data science easier in some respects, there are also some unintended consequences that actually might make data science harder. Improved data collection has been a boon for the data science industry, but using the data that is streaming in is similar to drinking out of a firehose. Data scientists are continually required to come up with more complicated ways of taking data in, because the stream of data has become incredibly strong. While R and Python are definitely easier to learn than older alternatives, neither language is usually accused of being parsimonious. What a skilled Haskell programming might be able to do in 100 lines, might take a less skilled Python scripter 500 lines. Hadoop, and tools like it, simplify the data science process, but it seems like there are 50 new tools like Hadoop a day. While these tools are powerful and useful, sometimes data scientists spend more time learning about tools and less time doing data science, just to keep up with the industry’s landscape. So, like many other fields related to computer science and programming, new tech is simultaneously making things easier and harder. Golden age of data science Let me rephrase the title question in an effort to provide even more illumination: is now the best time to be a data scientist or to become one? The answer to this question is a resounding yes. While all of the current drawbacks I brought up remain true, I believe that we are in a golden age of data science, for all of the reasons already mentioned, and more. We have more data than ever before and our data collection abilities are improving at an exponential rate. The current situation has gone so far as to create the necessity for a whole new field of data analysis, Big Data. Data science is one of the most vast and quickly expanding human frontiers at present. Part of the reason for this is what data science can be used for. Data science can effectively answer questions that were previously unanswered. Of course this makes for an attractive field of study from a research standpoint. One final note on whether or not data science is getting easier. If you are a person who actually creates new methods or techniques in data science, especially if you need to support these methods and techniques with formal mathematical and scientific reasoning, data science is definitely not getting easier for you. As I just mentioned, Big Data is a whole new field of data science created to deal with new problems caused by the efficacy of new data collection techniques. If you are a researcher or academic, all of this means a lot of work. Bootstrapped standard errors were used in data analysis before a formal proof of their legitimacy was created. Data science techniques might move at the speed of light, but formalizing and proving these techniques can literally take lifetimes. So if you are a researcher or academic, things will only get harder. If you are more of a practical data scientist, it may be slightly easier for now, but there’s always something! About the Author Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for theDepartment of Transportation as a transportation demand modeler.
Read more
  • 0
  • 0
  • 3373

article-image-containerized-data-science-docker
Darwin Corn
03 Jul 2016
4 min read
Save for later

Containerized Data Science with Docker

Darwin Corn
03 Jul 2016
4 min read
So, you're itching to begin your journey into data science but you aren't sure where to start. Well, I'm glad you’ve found this post since I will give the details in a step-by-step fashion as to how I circumvented the unnecessarily large technological barrier to entry and got my feet wet, so to speak. Containerization in general and Docker in particular have taken the IT world by storm in the last couple of years by making LXC containers more than just VM alternatives for the enterprising sysadmin. Even if you're coming at this post from a world devoid of IT, the odds are good that you've heard of Docker and their cute whale mascot. Of course, now that Microsoft is on board, the containerization bandwagon and a consortium of bickering stakeholders have formed, so you know that container tech is here to stay. I know, FreeBSD has had the concept of 'jails' for almost two decades now. But thanks to Docker, container tech is now usable across the big three of Linux, Windows and Mac (if a bit hack-y in the case of the latter two), and today we're going to use its positives in an exploration into the world of data science. Now that I have your interest piqued, you're wondering where the two intersect. Well, if you're like me, you've looked at the footprint of R-studio and the nightmare maze of dependencies of IPython and “noped” right out of there. Thanks to containers, these problems are solved! With Docker, you can limit the amount of memory available to the container, and the way containers are constructed ensures that you never have to deal with troubleshooting broken dependencies on update ever again. So let's install Docker, which is as straightforward as using your package manager in Linux, or downloading Docker Toolbox if you're using a Mac or Windows PC, and running the installer. The instructions that follow will be tailored to a Linux installation, but are easily adapted to Windows or Mac as well. On those two platforms, you can even bypass these CLI commands and use Kitematic, or so I hear. Now that you have Docker installed, let's look at some use cases for how to use it to facilitate our journey into data science. First, we are going to pull the Jupyter Notebook container so that you can work with that language-agnostic tool. # docker run --rm -it -p 8888:8888 -v "$(pwd):/notebooks" jupyter/notebook The -v "$(pwd):/notebooks" flag will mount the current directory to the /notebooks directory in the container, allowing you to save your work outside the container. This will be important because you’ll be using the container as a temporary working environment. The --rm flag ensures that the container is destroyed when it exits. If you rerun that command to get back to work after turning off your computer for instance, the container will be replaced with an entirely new one. That flag allows it access to the folder on the local filesystem, ensuring that your work survives the casually disposable nature of development containers. Now go ahead and navigate to http://localhost:8888, and let's get to work. You did bring a dataset to analyze in a notebook, right? The actual nuts and bolts of data science are beyond the scope of this post, but for a quick intro to data and learning materials, I've found Kaggle to be a great resource. While we're at it, you should look at that other issue I mentioned previously—that of the application footprint. Recently a friend of mine convinced me to use R, and I was enjoying working with the language until I got my hands on some real data and immediately felt the pain of an application not designed for endpoint use. I ran a regression and it locked up my computer for minutes! Fortunately, you can use a container to isolate it and only feed it limited resources to keep the rest of the computer happy. # docker run -m 1g -ti --rm r-base This command will drop you into an interactive R CLI that should keep even the leanest of modern computers humming along without a hiccup. Of course, you can also use the -c and --blkio-weight flags to restrict access to the CPU and HDD resources respectively, if limiting it to the GB of RAM wasn't enough. So, a program installation and a command or two (or a couple of clicks in the Kitematic GUI), and we're off and running using data science with none of the typical headaches. About the Author Darwin Corn is a systems analyst for the Consumer Direct Care Network. He is a mid-level professional with diverse experience in the information technology world.
Read more
  • 0
  • 0
  • 3231