Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Tech Guides - Data Analysis

34 Articles
article-image-libraries-for-geospatial-analysis
Aarthi Kumaraswamy
22 May 2018
12 min read
Save for later

Top 7 libraries for geospatial analysis

Aarthi Kumaraswamy
22 May 2018
12 min read
The term geospatial refers to finding information that is located on the earth's surface. This can include, for example, the position of a cellphone tower, the shape of a road, or the outline of a country. Geospatial data often associates some piece of information with a particular location. Geospatial development is the process of writing computer programs that can access, manipulate, and display this type of information. Internally, geospatial data is represented as a series of coordinates, often in the form of latitude and longitude values. Additional attributes, such as temperature, soil type, height, or the name of a landmark, are also often present. There can be many thousands (or even millions) of data points for a single set of geospatial data. In addition to the prosaic tasks of importing geospatial data from various external file formats and translating data from one projection to another, geospatial data can also be manipulated to solve various interesting problems. Obvious examples include the task of calculating the distance between two points, calculating the length of a road, or finding all data points within a given radius of a selected point. We use libraries to solve all of these problems and more. Today we will look at the major libraries used to process and analyze geospatial data. GDAL/OGR GEOS Shapely Fiona Python Shapefile Library (pyshp) pyproj Rasterio GeoPandas This is an excerpt from the book, Mastering Geospatial Analysis with Python by Paul Crickard, Eric van Rees, and Silas Toms. Geospatial Data Abstraction Library (GDAL) and the OGR Simple Features Library The Geospatial Data Abstraction Library (GDAL)/OGR Simple Features Library combines two separate libraries that are generally downloaded together as a GDAL. This means that installing the GDAL package also gives access to OGR functionality. The reason GDAL is covered first is that other packages were written after GDAL, so chronologically, it comes first. As you will notice, some of the packages covered in this post extend GDAL's functionality or use it under the hood. GDAL was created in the 1990s by Frank Warmerdam and saw its first release in June 2000. Later, the development of GDAL was transferred to the Open Source Geospatial Foundation (OSGeo). Technically, GDAL is a little different than your average Python package as the GDAL package itself was written in C and C++, meaning that in order to be able to use it in Python, you need to compile GDAL and its associated Python bindings. However, using conda and Anaconda makes it relatively easy to get started quickly. Because it was written in C and C++, the online GDAL documentation is written in the C++ version of the libraries. For Python developers, this can be challenging, but many functions are documented and can be consulted with the built-in pydoc utility, or by using the help function within Python. Because of its history, working with GDAL in Python also feels a lot like working in C++ rather than pure Python. For example, a naming convention in OGR is different than Python's since you use uppercase for functions instead of lowercase. These differences explain the choice for some of the other Python libraries such as Rasterio and Shapely, which are also covered in this chapter, that has been written from a Python developer's perspective but offer the same GDAL functionality. GDAL is a massive and widely used data library for raster data. It supports the reading and writing of many raster file formats, with the latest version counting up to 200 different file formats that are supported. Because of this, it is indispensable for geospatial data management and analysis. Used together with other Python libraries, GDAL enables some powerful remote sensing functionalities. It's also an industry standard and is present in commercial and open source GIS software. The OGR library is used to read and write vector-format geospatial data, supporting reading and writing data in many different formats. OGR uses a consistent model to be able to manage many different vector data formats. You can use OGR to do vector reprojection, vector data format conversion, vector attribute data filtering, and more. GDAL/OGR libraries are not only useful for Python programmers but are also used by many GIS vendors and open source projects. The latest GDAL version at the time of writing is 2.2.4, which was released in March 2018. GEOS The Geometry Engine Open Source (GEOS) is the C/C++ port of a subset of the Java Topology Suite (JTS) and selected functions. GEOS aims to contain the complete functionality of JTS in C++. It can be compiled on many platforms, including Python. As you will see later on, the Shapely library uses functions from the GEOS library. In fact, there are many applications using GEOS, including PostGIS and QGIS. GeoDjango, also uses GEOS, as well as GDAL, among other geospatial libraries. GEOS can also be compiled with GDAL, giving OGR all of its capabilities. The JTS is an open source geospatial computational geometry library written in Java. It provides various functionalities, including a geometry model, geometric functions, spatial structures and algorithms, and i/o capabilities. Using GEOS, you have access to the following capabilities—geospatial functions (such as within and contains), geospatial operations (union, intersection, and many more), spatial indexing, Open Geospatial Consortium (OGC) well-known text (WKT) and well-known binary (WKB) input/output, the C and C++ APIs, and thread safety. Shapely Shapely is a Python package for manipulation and analysis of planar features, using functions from the GEOS library (the engine of PostGIS) and a port of the JTS. Shapely is not concerned with data formats or coordinate systems but can be readily integrated with such packages. Shapely only deals with analyzing geometries and offers no capabilities for reading and writing geospatial files. It was developed by Sean Gillies, who was also the person behind Fiona and Rasterio. Shapely supports eight fundamental geometry types that are implemented as a class in the shapely.geometry module—points, multipoints, linestrings, multilinestrings, linearrings, multipolygons, polygons, and geometrycollections. Apart from representing these geometries, Shapely can be used to manipulate and analyze geometries through a number of methods and attributes. Shapely has mainly the same classes and functions as OGR while dealing with geometries. The difference between Shapely and OGR is that Shapely has a more Pythonic and very intuitive interface, is better optimized, and has a well-developed documentation. With Shapely, you're writing pure Python, whereas with GEOS, you're writing C++ in Python. For data munging, a term used for data management and analysis, you're better off writing in pure Python rather than C++, which explains why these libraries were created. For more information on Shapely, consult the documentation. This page also has detailed information on installing Shapely for different platforms and how to build Shapely from the source for compatibility with other modules that depend on GEOS. This refers to the fact that installing Shapely will require you to upgrade NumPy and GEOS if these are already installed. Fiona Fiona is the API of OGR. It can be used for reading and writing data formats. The main reason for using it instead of OGR is that it's closer to Python than OGR as well as more dependable and less error-prone. It makes use of two markup languages, WKT and WKB, for representing spatial information with regards to vector data. As such, it can be combined well with other Python libraries such as Shapely, you would use Fiona for input and output, and Shapely for creating and manipulating geospatial data. While Fiona is Python compatible and our recommendation, users should also be aware of some of the disadvantages. It is more dependable than OGR because it uses Python objects for copying vector data instead of C pointers, which also means that they use more memory, which affects the performance. Python shapefile library (pyshp) The Python shapefile library (pyshp) is a pure Python library and is used to read and write shapefiles. The pyshp library's sole purpose is to work with shapefiles—it only uses the Python standard library. You cannot use it for geometric operations. If you're only working with shapefiles, this one-file-only library is simpler than using GDAL. pyproj The pyproj is a Python package that performs cartographic transformations and geodetic computations. It is a Cython wrapper to provide Python interfaces to PROJ.4 functions, meaning you can access an existing library of C code in Python. PROJ.4 is a projection library that transforms data among many coordinate systems and is also available through GDAL and OGR. The reason that PROJ.4 is still popular and widely used is two-fold: Firstly, because it supports so many different coordinate systems Secondly, because of the routes it provides to do this—Rasterio and GeoPandas, two Python libraries covered next, both use pyproj and thus PROJ.4 functionality under the hood The difference between using PROJ.4 separately instead of using it with a package such as GDAL is that it enables you to re-project individual points, and packages using PROJ.4 do not offer this functionality. The pyproj package offers two classes—the Proj class and the Geod class. The Proj class performs cartographic computations, while the Geod class performs geodetic computations. Rasterio Rasterio is a GDAL and NumPy-based Python library for raster data, written with the Python developer in mind instead of C, using Python language types, protocols, and idioms. Rasterio aims to make GIS data more accessible to Python programmers and helps GIS analysts learn important Python standards. Rasterio relies on concepts of Python rather than GIS. Rasterio is an open source project from the satellite team of Mapbox, a provider of custom online maps for websites and applications. The name of this library should be pronounced as raster-i-o rather than ras-te-rio. Rasterio came into being as a result of a project called the Mapbox Cloudless Atlas, which aimed to create a pretty-looking basemap from satellite imagery. One of the software requirements was to use open source software and a high-level language with handy multi-dimensional array syntax. Although GDAL offers proven algorithms and drivers, developing with GDAL's Python bindings feels a lot like C++. Therefore, Rasterio was designed to be a Python package at the top, with extension modules (using Cython) in the middle, and a GDAL shared library on the bottom. Other requirements for the raster library were being able to read and write NumPy ndarrays to and from data files, use Python types, protocols, and idioms instead of C or C++ to free programmers from having to code in two languages. For georeferencing, Rasterio follows the lead of pyproj. There are a couple of capabilities added on top of reading and writing, one of them being a features module. Reprojection of geospatial data can be done with the rasterio.warp module. Rasterio's project homepage can be found on Github. GeoPandas GeoPandas is a Python library for working with vector data. It is based on the pandas library that is part of the SciPy stack. SciPy is a popular library for data inspection and analysis, but unfortunately, it cannot read spatial data. GeoPandas was created to fill this gap, taking pandas data objects as a starting point. The library also adds functionality from geographical Python packages. GeoPandas offers two data objects—a GeoSeries object that is based on a pandas Series object and a GeoDataFrame, based on a pandas DataFrame object, but adding a geometry column for each row. Both GeoSeries and GeoDataFrame objects can be used for spatial data processing, similar to spatial databases. Read and write functionality is provided for almost every vector data format. Also, because both Series and DataFrame objects are subclasses from pandas data objects, you can use the same properties to select or subset data, for example .loc or .iloc. GeoPandas is a library that employs the capabilities of newer tools, such as Jupyter Notebooks, pretty well, whereas GDAL enables you to interact with data records inside of vector and raster datasets through Python code. GeoPandas takes a more visual approach by loading all records into a GeoDataFrame so that you can see them all together on your screen. The same goes for plotting data. These functionalities were lacking in Python 2 as developers were dependent on IDEs without extensive data visualization capabilities which are now available with Jupyter Notebooks. We've provided an overview of the most important open source packages for processing and analyzing geospatial data. The question then becomes when to use a certain package and why. GDAL, OGR, and GEOS are indispensable for geospatial processing and analyzing, but were not written in Python, and so they require Python binaries for Python developers. Fiona, Shapely, and pyproj were written to solve these problems, as well as the newer Rasterio library. For a more Pythonic approach, these newer packages are preferable to the older C++ packages with Python binaries (although they're used under the hood). Now that you have an idea of what options are available for a certain use case and why one package is preferable over another, here’s something you should always remember. As is often the way in programming, there might be multiple solutions for one particular problem. For example, when dealing with shapefiles, you could use pyshp, GDAL, Shapely, or GeoPandas, depending on your preference and the problem at hand. Introduction to Data Analysis and Libraries 15 Useful Python Libraries to make your Data Science tasks Easier “Pandas is an effective tool to explore and analyze data”: An interview with Theodore Petrou Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data  
Read more
  • 0
  • 0
  • 18677

article-image-alteryx-vs-tableau-choosing-the-right-data-analytics-tool-for-your-business
Guest Contributor
04 Mar 2019
6 min read
Save for later

Alteryx vs. Tableau: Choosing the right data analytics tool for your business

Guest Contributor
04 Mar 2019
6 min read
Data Visualization is commonly used in the modern world, where most business decisions are taken into consideration by analyzing the data. One of the most significant benefits of data visualization is that it enables us to visually access huge amounts of data in easily understandable visuals. There are many areas where data visualization is being used. Some of the data visualization tools include Tableau, Alteryx, Infogram, ChartBlocks, Datawrapper, Plotly, Visual.ly, etc. Tableau and Alteryx are industry standard tools and have dominated the data analytics market for a few years now and still running strong without any strong competition. In this article, we will understand the core differences between Alteryx tool and Tableau. This will help us in deciding which tool to use for what purposes. Tableau is one of the top-rated tools which helps the analysts to carry out business intelligence and data visualization activities. Using Tableau, the users will be able to generate compelling dashboards and stunning data visualizations. Tableau’s interactive user interface helps users to quickly generate reports where they can drill down the information to a granular level. Alteryx is a powerful tool widely used in data analytics and also provides meaningful insights to the executive level personnel. With the user-friendly interface, the user will be able to extract the data, transform the data, and load the data within the Alteryx tool. Why use Alteryx with Tableau? The use of Alteryx with Tableau is a powerful combination when it comes to getting value-added data decisions. With Alteryx, businesses can manipulate their data and provide input to the Tableau platform, which in return will be able to showcase strong data visualizations. This will help the businesses to take appropriate actions which are backed up with data analysis. Alteryx and Tableau tools are widely used within organizations where the decisions can be taken into consideration based on the insights obtained from data analysis. Talking about data handling, Alteryx is a powerful ETL platform where data can be analyzed in different formats. When it comes to data representation, Tableau is a perfect match. Further, using Tableau the reports can be shared across team members. Nowadays, most of the businesses want to see real-time data and want to understand business trends. The combination of Alteryx and Tableau allows the data analysts to analyze the data, and generate meaningful insights to the users, on-the-fly. Here, data analysis can be executed within the Alteryx tool where the raw data is handled, and then the data representation or visualization is done in Tableau, so both of these tools go hand in hand. Tableau vs Alteryx The table below lists the differences between the tools. Alteryx Tableau This tool is known as a smart data analytics platform. This tool is known for its data visualization capabilities. 2. Can connect with different data sources and can synthesize the raw data. A standard ETL process is possible. 2. Can connect with different data sources and provide data visualization within minutes from the gathered data. 3. Helps in terms of the data analysis 3. Helps in terms of building appealing graphs. 4. The GUI is okay and widely accepted. 4. The GUI is one of the best features where graphs can be easily built by using drag and drop options. 5. Technical knowledge is necessary because it involves in data sources integrations, and also data blending activity. 5. Technical knowledge is not necessary, because all the data will be polished and only the user has to build graphs/visualization. 6.  Once the data blending activity is completed, the users will be able to share the file which can be consumed by Tableau. 6. Once the graphs are prepared, the reports can be easily shared among team members without any hassle. 7. A lot of flexibility while using this tool for data blending activity. 7. Flexibility while using the tool for data visualization. 8. Using this tool, the users will be able to do spatial and predictive analysis 8. Possible by representing the data in an appropriate format. 9.  One of the best tools when it comes to data preparations. 9. Not feasible to prepare the data in Tableau when it is compared to Alteryx. 10. Data representation cannot be done accurately. 10. It is a wonderful tool for data representation. 11. Has one time feeds- Annual fees 11. Has an option to pay monthly as well. 12. Has a drag and drop interface where the user can develop a workflow easily. 12. Has a drag and drop interface where the user will be able to build a visualization in no time. Alteryx and Tableau Integration As discussed earlier, these two tools have their own advantages and disadvantages, but when integrated together, they can do wonders with the data. This integration between Tableau and Alteryx makes the task of visualizing the Alteryx generated answers quite simple. The data is first loaded into the Alteryx tool and is then extracted in the form of .tde files (i.e. Tableau Data Extracted Files). These .tde files will be consumed by Tableau tool to do the data visualization part. On a regular basis, the data extracted file from Alteryx tool (i.e. .tde files) will be generated and will replace the old .tde files. Thus, by integrating Alteryx and Tableau, we can: Cleanse, combine, as well as collect all the data sources that are relevant and enrich them with the help of third-party data - everything in one workflow. Give analytical context to your data by providing predictive, location-based, and deep spatial analytics. Publish your analytic workflows’ results to Tableau for intuitive, rich visualizations that help you in making decisions more quickly. Tableau and Alteryx do not require any advanced skill-set as both tools have simple drag and drop interfaces. You can create a workflow in Alteryx that can process data in a sequential manner. In a similar way, Tableau enables you to build charts by dragging various fields to be utilized, to specified areas. The companies which have a lot of data to analyze, and can spend large amounts of money on analytics, can use these two tools. There doesn’t exist any significant challenges during Tableau, Alteryx integration. Conclusion When Tableau and Alteryx are used together, it is really useful for the businesses so that the senior management can take decisions based on the data insights provided by these tools. These two tools compliment each other and provide high-quality service to businesses. Author Bio Savaram Ravindra is a Senior Content Contributor at Mindmajix.com. His passion lies in writing articles on different niches, which include some of the most innovative and emerging software technologies, digital marketing, businesses, and so on. By being a guest blogger, he helps his company acquire quality traffic to its website and build its domain name and search engine authority. Before devoting his work full time to the writing profession, he was a programmer analyst at Cognizant Technology Solutions. Follow him on LinkedIn and Twitter. How to share insights using Alteryx Server How to do data storytelling well with Tableau [Video] A tale of two tools: Tableau and Power BI  
Read more
  • 0
  • 0
  • 10985

article-image-what-does-a-data-science-team-look-like
Fatema Patrawala
21 Nov 2019
11 min read
Save for later

What does a data science team look like?

Fatema Patrawala
21 Nov 2019
11 min read
Until a couple of years ago, people barely knew the term 'data science' which has now evolved into an extremely popular career field. The Harvard Business Review dubbed data scientist within the data science team as the sexiest job of the 21st century and expert professionals jumped on the data is the new oil bandwagon. As per the Figure Eight Report 2018, which takes the pulse of the data science community in the US, a lot has changed rapidly in the data science field over the years. For the 2018 report, they surveyed approximately 240 data scientists and found out that machine learning projects have multiplied and more and more data is required to power them. Data science and machine learning jobs are LinkedIn's fastest growing jobs. And the internet is creating 2.5 quintillion bytes of data to process and analyze each day. With all these changes, it is evident for data science teams to evolve and change among various organizations. The data science team is responsible for delivering complex projects where system analysis, software engineering, data engineering, and data science is used to deliver the final solution. To achieve all of this, the team does not only have a data scientist or a data analyst but also includes other roles like business analyst, data engineer or architect, and chief data officer. In this post, we will differentiate and discuss various job roles within a data science team, skill sets required and the compensation benefit for each one of them. For an in-depth understanding of data science teams, read the book, Managing Data Science by Kirill Dubovikov, which has interesting case studies on building successful data science teams. He also explores how the team can efficiently manage data science projects through the use of DevOps and ModelOps.  Now let's get into understanding individual data science roles and functions, but before that we take a look at the structure of the team.There are three basic team structures to match different stages of AI/ML adoption: IT centric team structure At times for companies hiring a data science team is not an option, and they have to leverage in-house talent. During such situations, they take advantage of the fully functional in-house IT department. The IT team manages functions like data preparation, training models, creating user interfaces, and model deployment within the corporate IT infrastructure. This approach is fairly limited, but it is made practical by MLaaS solutions. Environments like Microsoft Azure or Amazon Web Services (AWS) are equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy. Microsoft Azure, for instance, supports its users with detailed documentation for a low entry threshold. The documentation helps in fast training and early deployment of models even without an expert data scientists on board. Integrated team structure Within the integrated structure, companies have a data science team which focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure for model deployment. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations. Unlike the IT centric approach, the integrated method requires having an experienced data scientist within the team. This approach ensures better operational flexibility in terms of available techniques. Additionally, the team leverages deeper understanding of machine learning tools and libraries – like TensorFlow or Theano which are specifically for researchers and data science experts. Specialized data science team Companies can also have an independent data science department to build an all-encompassing machine learning applications and frameworks. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are handled by a dedicated data science team. It doesn't necessarily mean that all team members should have a data science background, but they should have technology background with certain service management skills. A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML backed services. Today's most successful Silicon Valley tech operates with specialized data science teams. Additionally they are custom-built and wired for specific tasks to achieve different business goals. For example, the team structure at Airbnb is one of the most interesting use cases. Martin Daniel, a data scientist at Airbnb in this talk explains how the team emphasizes on having an experimentation-centric culture and apply machine learning rigorously to address unique product challenges. Job roles and responsibilities within data science team As discussed earlier, there are many roles within a data science team. As per Michael Hochster, Director of Data Science at Stitch Fix, there are two types of data scientists: Type A and Type B. Type A stands for analysis. Individuals involved in Type A are statisticians that make sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc. Type B stands for building. These individuals use data in production. They're good software engineers with strong programming knowledge and statistics background. They build recommendation systems, personalization use cases, etc. Though it is rare that one expert will fit into a single category. But understanding these data science functions can help make sense of the roles described further. Chief data officer/Chief analytics officer The chief data officer (CDO) role has been taking organizations by storm. A recent NewVantage Partners' Big Data Executive Survey 2018 found that 62.5% of Fortune 1000 business and technology decision-makers said their organization appointed a chief data officer. The role of chief data officer involves overseeing a range of data-related functions that may include data management, ensuring data quality and creating data strategy. He or she may also be responsible for data analytics and business intelligence, the process of drawing valuable insights from data. Even though chief data officer and chief analytics officer (CAO) are two distinct roles, it is often handled by the same person. Expert professionals and leaders in analytics also own the data strategy and how a company should treat its data. It does make sense as analytics provide insights and value to the data. Hence, with a CDO+CAO combination companies can take advantage of a good data strategy and proper data management without losing on quality. According to compensation analysis from PayScale, the median chief data officer salary is $177,405 per year, including bonuses and profit share, ranging from $118,427 to $313,791 annually. Skill sets required: Data science and analytics, programming skills, domain expertise, leadership and visionary abilities are required. Data analyst The data analyst role implies proper data collection and interpretation activities. The person in this job role will ensure that collected data is relevant and exhaustive while also interpreting the results of the data analysis. Some companies also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics. As per Indeed, the average salary for a data analyst is $68,195 per year in the United States. Skill sets required: Programming languages like R, Python, JavaScript, C/C++, SQL. With this critical thinking, data visualization and presentation skills will be good to have. Data scientist Data scientists are data experts who have the technical skills to solve complex problems and the curiosity to explore what problems are needed to be solved. A data scientist is an individual who develops machine learning models to make predictions and is well versed in algorithm development and computer science. This person will also know the complete lifecycle of the model development. A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets. According to Glassdoor, the current U.S. average salary for a data scientist is $118,709. Skills set required: A data scientist will require knowledge of big data platforms and tools like  Seahorse powered by Apache Spark, JupyterLab, TensorFlow and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R. They should also have cloud computing capabilities and knowledge of various cloud platforms like AWS, Microsoft Azure etc.You can also read this post on how to ace a data science interview to know more. Machine learning engineer At times a data scientist is confused with machine learning engineers, but a machine learning engineer is a distinct role that involves different responsibilities. A machine learning engineer is someone who is responsible for combining software engineering and machine modeling skills. This person determines which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is the ML engineer's job. The average machine learning engineer's salary is $146,085 in the US, and is ranked No.1 on the Indeed's Best Jobs in 2019 list. Skill sets required: Machine learning engineers will be required to have expertise in computer science and programming languages like R, Python, Scala, Java etc. They would also be required to have probability techniques, data modelling and evaluation techniques. Data architects and data engineers The data architects and data engineers work in tandem to conceptualize, visualize, and build an enterprise data management framework. The data architect visualizes the complete framework to create a blueprint, which the data engineer can use to build a digital framework. The data engineering role has recently evolved from the traditional software-engineering field.  Recent enterprise data management experiments indicate that the data-focused software engineers are needed to work along with the data architects to build a strong data architecture. Average salary for a data architect in the US ranges from $1,22,000 to $1,29, 000 annually as per a recent LinkedIn survey. Skill sets required: A data architect or an engineer should have a keen interest and experience in programming languages frameworks like HTML5, RESTful services, Spark, Python, Hive, Kafka, and CSS etc. They should have the required knowledge and experience to handle database technologies such as PostgreSQL, MapReduce and MongoDB and visualization platforms such as; Tableau, Spotfire etc. Business analyst A business analyst (BA) basically handles Chief analytics officer's role but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst can bridge the gap. They are responsible for using data analytics to assess processes, determine requirements and deliver data-driven recommendations and reports to executives and stakeholders. BAs engage with business leaders and users to understand how data-driven changes will be implemented to processes, products, services, software and hardware. They further articulate these ideas and balance them against technologically feasible and financially reasonable. The average salary for a business analyst is $75,078 per year in the United States, as per Indeed. Skill sets required: Excellent domain and industry expertise will be required. With this good communication as well as data visualization skills and knowledge of business intelligence tools will be good to have. Data visualization engineer This specific role is not present in each of the data science teams as some of the responsibilities are realized by either a data analyst or a data architect. Hence, this role is only necessary for a specialized data science model. The role of a data visualization engineer involves having a solid understanding of UI development to create custom data visualization elements for your stakeholders. Regardless of the technology, successful data visualization engineers have to understand principles of design, both graphical and more generally user-centered design. As per Payscale, the average salary for a data visualization engineer is $98,264. Skill sets required: A data visualization engineer need to have rigorous knowledge of data visualization methods and be able to produce various charts and graphs to represent data. Additionally they must understand the fundamentals of design principles and visual display of information. To sum it up, a data science team has evolved to create a number of job roles and opportunities, but companies still face challenges in building up the team from scratch and find it hard to figure where to start from. If you are facing a similar dilemma, check out this book, Managing Data Science, written by Kirill Dubovikov. It covers concepts and methodologies to manage and deliver top-notch data science solutions, while also providing guidance on hiring, growing and sustaining a successful data science team. How to learn data science: from data mining to machine learning How to ace a data science interview Data science vs. machine learning: understanding the difference and what it means today 30 common data science terms explained 9 Data Science Myths Debunked
Read more
  • 0
  • 0
  • 10189

article-image-how-everyone-at-netflix-uses-jupyter-notebooks-from-data-scientists-machine-learning-engineers-to-data-analysts
Bhagyashree R
18 Aug 2018
4 min read
Save for later

How everyone at Netflix uses Jupyter notebooks from data scientists, machine learning engineers, to data analysts

Bhagyashree R
18 Aug 2018
4 min read
Netflix uses a variety of tools to do data analysis. One of the big ways that data scientists and engineers at Netflix interact with their data is through Jupyter notebooks. In addition to providing execution environments to users, Netflix invests in various parts of the Jupyter ecosystem and tooling. They are “reimagining what a notebook can be, who can use it, and what they can do with it.” Netflix aims to provide personalized content to their 130 million viewers. For this every day more than 1 trillion events are written into a streaming ingestion pipeline. To support this, they’ve built an industry-leading data platform which is flexible, powerful, and complex. There are so many diverse users of this platform, such as analytics engineers, data engineers, and data scientists, requiring different sets of tools and languages. To help the platform scale, they wanted to minimize the number of tools and the solution to this was the open-source tool: Jupyter notebooks. Why Jupyter notebook is so compelling for Netflix? These are the functionalities provided by notebook that benefits Netflix’s data scientists and engineers: Standard messaging API: The Jupyter protocol provides a standard messaging API with the kernels that act as computational engines. It separates where the content is written and where the content is executed. This makes it language agnostic. Editable file format: It provides an editable file format that stores the code and results together. Web-based UI: It is web-based which helps interactively writing and running code as well as visualizing outputs. How Netflix uses Jupyter Notebooks? The following are some of the use cases they use Jupyter notebooks for: Data access: Notebooks were first introduced for workflows and their adoption grew among the data scientists. Seeing this, Netflix decided to leverage its versatility and architecture for general data access. Notebooks provide an user-friendly interface for interactively running code, exploring the outputs, and visualizing data all from a single cloud-based development environment. Notebook Templates: They introduced parameterized notebooks, which allow the use of parameters in the code and take values as input at runtime. These templates help: Data scientists to run an experiment with different coefficients and summarize the results Data engineers to execute data quality audits Data analysts to share prepared queries and visualizations Software engineers to email the results of a troubleshooting script Scheduling notebooks: Next they are using notebooks for creating a unifying layer for scheduling workflows. Notebooks are used for interactive work and allows smooth move to scheduling that work to run recurrently. Many users create an entire workflow in a notebook and just copy/paste it into separate files for scheduling when they’re ready to deploy it. Notebook infrastructure: The three fundamental components of the infrastructure are: storage, compute, and interface. Source: Netflix Tech Blog Storage: The Netflix Data Platform is made of Amazon S3 and EFS for cloud storage, which notebooks treat as virtual filesystems. Each user has a home directory on EFS containing a personal workspace for notebooks. This workspace is for storing any notebook created or uploaded by a user. When a user launches a notebook interactively, all the reading and writing happens at the workspace. Compute: All the jobs on the data platform run on containers including queries, pipelines and notebooks. A container with reasonable default resources is provisioned when a user launches a notebook. Users can request more resources if they find that the provided resources are not enough. A unified execution environment with a prepared container image is provided, which has common libraries and an array of default kernels preinstalled. The orchestration and environments are managed with Titus, their container management platform. Interface: They are using nteract, a React-based frontend for Jupyter notebooks, which emphasizes simplicity and composability as core design principles.They’re also introducing native support for parameterization, which makes it easier to schedule notebooks and create reusable templates. Netflix is planning to make investments in both the frontend and backend to improve the overall notebook experience. This year they are also sponsoring JupyterCon. To read more about how Jupyter is offering value to Netflix read Netflix’s original post at Medium. 10 reasons why data scientists love Jupyter notebooks What’s new in Jupyter Notebook 5.3.0 Netflix open sources Zuul 2 cloud gateway
Read more
  • 0
  • 0
  • 8648

article-image-top-8-ways-to-improve-your-data-visualizations
Natasha Mathur
04 Jul 2018
7 min read
Save for later

8 ways to improve your data visualizations

Natasha Mathur
04 Jul 2018
7 min read
In Dr. W.Edwards Deming’s words “In God we trust, all others must bring data”. Organizations worldwide, revolve around data like planets revolve around the sun. Since data is so central to organizations, there are certain data visualization tools that help them understand data to make better business decisions. A lot more data is getting churned out and collected by organizations than ever before. So, how to make sense of all this data? Humans are visual creatures and our human brain processes visual information far better than textual information. In fact, presentations that use visual aids such as colors, shapes, images, etc, are found to be far more persuasive according to a research done by University of Minnesota back in 1986. Data visualization is one such process that easily translates the collected information into engaging visuals. It’s easy, cheap and doesn’t require any designing expertise to create data visuals. However, some professionals feel that data visualization is just limited to slapping on charts and graphs when that’s not actually the case. Data visualization is about conveying the right information, in a way that enhances the audience’s experience. So, if you want your graphs and charts to be more succinct and understandable, here are eight ways to improve your data visualization process: 1. Get rid of unneeded information Less is more in some cases and the same goes for data visualization. Using excessive color, jargons, pie charts and metrics take away focus from the important information. For instance, when using colors, don’t make your charts and graphs a rainbow instead use a specific set of colors with a clear purpose and meaning. Do you see the difference color and chart make to visualization in the below images? Source: Podio Similarly, when it comes to expressing your data, note how people interact at your workplace. Keep the tone of your visuals as natural as possible to make it easy for the audience to interpret your data. For metrics, only show the ones that truly bring value to your storytelling. Filter out the ones that are not so important to create less fuss. Tread cautiously while using pie charts as they can be difficult to understand sometimes and also, get rid of the elements on a chart that cause unnecessary confusion.   Source: Dashboard Zone 2. Use conditional formatting for tabular data Data visualization doesn’t need to use fancy tools or designs. Take your standard excel table for example. Do you want to point out patterns or outliers in your data? Conditional formatting is a great tool for people working with data. It involves making simple rules on a given data and once that’s done, it’ll highlight only the data that matters the most to you. This helps quickly track the main information. Conditional formatting can be used for different things. It can help spot duplicate data in your table. You need to set bounds for the data using the built-in conditional formatting. It’ll then format the cells based on those bounds, highlighting the data you want. For instance, if sales quota of over 65% is good, between 65% and 55% is average, and below 55% is poor, then with conditional formatting, you can quickly find out who is meeting the expected sales quota, and who is not. 3. Add trendlines to unearth patterns for prediction Another feature that can amp up your data visualization is trendlines. They observe the relationship between two variables from your existing data. They are also are useful for predicting future values. Trendlines are simple to add and help discover trends in the given data set. Source: Interworks It also show data trends or moving averages in your charts. Depending on the kind of data you’re working with, there are a number of trendlines out there that you can use on your visualizations. Questions like whether a new strategy seems to be working in favor of the organization can be answered with the help of trendlines. This insight, in turn, helps predict new outcomes for the future. Statistical models are used in trendlines to make predictions. Once you add trend lines to a view, it’s up to you to decide how you want them to look and behave. 4. Implement filter by rule to get more specific Filter helps display just the information that you need. Using filter by rule, you can add filter option to your dataset. Organizations produce huge amounts of data on a regular basis. Suppose you want to know which employees within your organization are consistent performers. So, instead of creating a visualization that includes all the employees and their performances, you can filter it down, so that it shows only the employees who are always doing well. Similarly, if you want to find out which day the sales went up or down, you can filter it to show results for only the past week or month depending upon your preference. 5. For complex or dense data representation, add hierarchy Hierarchies eliminate the need to create extra visualizations. You can view data from a high level and dig deeper into the specifics of the data as you come up with questions based on the data. Adding a hierarchy to the data helps club multiple information in one visualization. Source: dzone For instance, if you create a hierarchy that shows the total sales achieved by different sales representative within an organization in the past month. Now, you can further break this down by selecting a particular sales rep, and then you can go even further by selecting a specific product assigned to that sales rep. This cuts down on a lot of extra work. 6. Make visuals more appealing by formatting data Data formatting takes only a few seconds but it can make a huge difference when it comes to the audience interpreting your data. Source: dzone It makes the numbers appear more visually appealing and easier to read for the audience. It can be used for charts such as bar charts and column charts. Formatting data to show a certain number of decimals, comma separators, number font, currency or percentage can make your visualization process more engaging. 7. Include comparison for more insight Comparisons provide readers a better perspective on data. It can both improve and add insights to your visualizations by including comparisons to your charts. For instance, in case you want to inform your audience about organization’s growth in current as well as the past year then you can include comparison within the visualization. You can also use a comparison chart to compare between two data points such as budget vs actually spent. 8. Sort data to improve readability Again, sorting through data is a great way to make things easy for the audience when dealing with huge quantities of data. For instance, if you want to include information about the highest and lowest performing products, you can sort your data. Sorting can be done in the following ways: Ascending - This helps sort the data from lowest to highest. Descending -  This sorts data from highest to lowest. Data source order - Sorts the data in the order it is sorted in the data source. Alphabetic - Data is alphabetically sorted. Manual -  Data can be sorted manually in the order you prefer. Effective data visualization helps people interpret the information in data that could not be seen before, to change their minds and prompt action. These were some of the tricks and features to take your data visualization game to the next level. There are different data visualization tools available in the market to choose from. Tableau and Microsoft Power BI are among the top ones that offer great features for data visualization. So, now that we’ve got you covered with some of the best practices for data visualization, it’s your turn to put these tips to practice and create some strong visual data stories. Do you have any DataViz tips to share with our readers? Please add them in the comments below. Getting started with Data Visualization in Tableau What is Seaborn and why should you use it for data visualization? “Tableau is the most powerful and secure end-to-end analytics platform”: An interview with Joshua Milligan  
Read more
  • 0
  • 0
  • 7727

article-image-5-best-practices-to-perform-data-wrangling-with-python
Savia Lobo
18 Oct 2018
5 min read
Save for later

5 best practices to perform data wrangling with Python

Savia Lobo
18 Oct 2018
5 min read
Data wrangling is the process of cleaning and structuring complex data sets for easy analysis and making speedy decisions in less time. Due to the internet explosion and the huge trove of IoT devices there is a massive availability of data, at present. However, this data is most often in its raw form and includes a lot of noise in the form of unnecessary data, broken data, and so on. Clean up of this data is essential in order to use it for analysis by organizations. Data wrangling plays a very important role here by cleaning this data and making it fit for analysis. Also, Python language has built-in features to apply any wrangling methods to various data sets to achieve the analytical goal. Here are 5 best practices that will help you out in your data wrangling journey with the help of Python. And at the end, all you’ll have is a clean and ready to use data for your business needs. 5 best practices for data wrangling with Python Learn the data structures in Python really well Designed to be a very high-level language, Python offers an array of amazing data structures with great built-in methods. Having a solid grasp of all the capabilities will be a potent weapon in your repertoire for handling data wrangling task. For example, dictionary in Python can act almost like a mini in-memory database with key-value pairs. It supports extremely fast retrieval and search by utilizing a hash table underneath. Explore other built-in libraries related to these data structures e.g. ordered dict, string library for advanced functions. Build your own version of essential data structures like stack, queues, heaps, and trees, using classes and basic structures and keep them handy for quick data retrieval and traversal. Learn and practice file and OS handling in Python How to open and manipulate files How to manipulate and navigate directory structure Have a solid understanding of core data types and capabilities of Numpy and Pandas How to create, access, sort, and search a Numpy array. Always think if you can replace a conventional list traversal (for loop) with a vectorized operation. This will increase speed of your data operation. Explore special file types like .npy (Numpy’s native storage) to access/read large data set with much higher speed than usual list. Know in details all the file types you can read using built-in Pandas methods. This will simplify to a great extent your data scraping. Almost all of these methods have great data cleaning and other checks built in. Try to use such optimized routines instead of writing your own to speed up the process. Build a good understanding of basic statistical tests and a panache for visualization Running some standard statistical tests can quickly give you an idea about the quality of the data you need to wrangle with. Plot data often even if it is multi-dimensional. Do not try to create fancy 3D plots. Learn to explore simple set of pairwise scatter plots. Use boxplots often to see the spread and range of the data and detect outliers. For time-series data, learn basic concepts of ARIMA modeling to check the sanity of the data Apart from Python, if you want to master one language, go for SQL As a data engineer, you will inevitably run across situations where you have to read from a large, conventional database storage. Even if you use Python interface to access such database, it is always a good idea to know basic concepts of database management and relational algebra. This knowledge will help you build on later and move into the world of Big Data and Massive Data Mining (technologies like Hadoop/Pig/Hive/Impala) easily. Your basic data wrangling knowledge will surely help you deal with such scenarios. Although Data wrangling may be the most time-consuming process, it is the most important part of the data management. Data collected by businesses on a daily basis can help them make decisions on the latest information available. It also allows businesses to find the hidden insights and use it in the decision-making processes and provide them with new analytic initiatives, improved reporting efficiency and much more. About the authors Dr. Tirthajyoti Sarkar works in San Francisco Bay area as a senior semiconductor technologist where he designs state-of-the-art power management products and applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He has 15+ years of R&D experience and is a senior member of IEEE. Shubhadeep Roychowdhury works as a Sr. Software Engineer at a Paris based Cyber Security startup where he is applying the state-of-the-art Computer Vision and Data Engineering algorithms and tools to develop cutting edge product. Data cleaning is the worst part of data analysis, say data scientists Python, Tensorflow, Excel and more – Data professionals reveal their top tools Manipulating text data using Python Regular Expressions (regex)
Read more
  • 0
  • 0
  • 7171
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-common-data-science-terms
Aarthi Kumaraswamy
16 May 2018
27 min read
Save for later

30 common data science terms explained

Aarthi Kumaraswamy
16 May 2018
27 min read
Let’s begin at the beginning. What do terms like statistical population, statistical comparison, statistical inference mean? What good is munging, coding, booting, regularization etc. On a scale of 1 to 30 (1 being the lowest and 30, the highest), rate yourself as a data scientist. No matter what you have scored yourself, we hope to have improved that score at least by a little, by the end of this post. Let’s start with a basic question: What is data science? [box type="shadow" align="" class="" width=""]The following is an excerpt from the book, Statistics for Data Science written by James D. Miller and published by Packt Publishing.[/box] The idea of how data science is defined is a matter of opinion. I personally like the explanation that data science is a progression or, even better, an evolution of thought or steps, as shown in the following figure: Although a progression or evolution implies a sequential journey, in practice, this is an extremely fluid process; each of the phases may inspire the data scientist to reverse and repeat one or more of the phases until they are satisfied. In other words, all or some phases of the process may be repeated until the data scientist determines that the desired outcome is reached. Depending on your sources and individual beliefs, you may say the following: Statistics is data science, and data science is statistics. Based upon personal experience, research, and various industry experts' advice, someone delving into the art of data science should take every opportunity to understand and gain experience as well as proficiency with the following list of common data science terms: Statistical population Probability False positives Statistical inference Regression Fitting Categorical data Classification Clustering Statistical comparison Coding Distributions Data mining Decision trees Machine learning Munging and wrangling Visualization D3 Regularization Assessment Cross-validation Neural networks Boosting Lift Mode Outlier Predictive modeling Big data Confidence interval Writing Statistical population You can perhaps think of a statistical population as a recordset (or a set of records). This set or group of records will be of similar items or events that are of interest to the data scientist for some experiment. For a data developer, a population of data may be a recordset of all sales transactions for a month, and the interest might be reporting to the senior management of an organization which products are the fastest sellers and at which time of the year. For a data scientist, a population may be a recordset of all emergency room admissions during a month, and the area of interest might be to determine the statistical demographics for emergency room use. [box type="note" align="" class="" width=""]Typically, the terms statistical population and statistical model are or can be used interchangeably. Once again, data scientists continue to evolve with their alignment on their use of common terms. [/box] Another key point concerning statistical populations is that the recordset may be a group of (actually) existing objects or a hypothetical group of objects. Using the preceding example, you might draw a comparison of actual objects as those actual sales transactions recorded for the month while the hypothetical objects as sales transactions are expected, forecast, or presumed (based upon observations or experienced assumptions or other logic) to occur during a month. Finally, through the use of statistical inference, the data scientist can select a portion or subset of the recordset (or population) with the intention that it will represent the total population for a particular area of interest. This subset is known as a statistical sample. If a sample of a population is chosen accurately, characteristics of the entire population (that the sample is drawn from) can be estimated from the corresponding characteristics of the sample. Probability Probability is concerned with the laws governing random events.                                           -www.britannica.com When thinking of probability, you think of possible upcoming events and the likelihood of them actually occurring. This compares to a statistical thought process that involves analyzing the frequency of past events in an attempt to explain or make sense of the observations. In addition, the data scientist will associate various individual events, studying the relationship of these events. How these different events relate to each other governs the methods and rules that will need to be followed when we're studying their probabilities. [box type="note" align="" class="" width=""]A probability distribution is a table that is used to show the probabilities of various outcomes in a sample population or recordset. [/box] False positives The idea of false positives is a very important statistical (data science) concept. A false positive is a mistake or an errored result. That is, it is a scenario where the results of a process or experiment indicate a fulfilled or true condition when, in fact, the condition is not true (not fulfilled). This situation is also referred to by some data scientists as a false alarm and is most easily understood by considering the idea of a recordset or statistical population (which we discussed earlier in this section) that is determined not only by the accuracy of the processing but by the characteristics of the sampled population. In other words, the data scientist has made errors during the statistical process, or the recordset is a population that does not have an appropriate sample (or characteristics) for what is being investigated. Statistical inference What developer at some point in his or her career, had to create a sample or test data? For example, I've often created a simple script to generate a random number (based upon the number of possible options or choices) and then used that number as the selected option (in my test recordset). This might work well for data development, but with statistics and data science, this is not sufficient. To create sample data (or a sample population), the data scientist will use a process called statistical inference, which is the process of deducing options of an underlying distribution through analysis of the data you have or are trying to generate for. The process is sometimes called inferential statistical analysis and includes testing various hypotheses and deriving estimates. When the data scientist determines that a recordset (or population) should be larger than it actually is, it is assumed that the recordset is a sample from a larger population, and the data scientist will then utilize statistical inference to make up the difference. [box type="note" align="" class="" width=""]The data or recordset in use is referred to by the data scientist as the observed data. Inferential statistics can be contrasted with descriptive statistics, which is only concerned with the properties of the observed data and does not assume that the recordset came from a larger population. [/box] Regression Regression is a process or method (selected by the data scientist as the best fit technique for the experiment at hand) used for determining the relationships among variables. If you're a programmer, you have a certain understanding of what a variable is, but in statistics, we use the term differently. Variables are determined to be either dependent or independent. An independent variable (also known as a predictor) is the one that is manipulated by the data scientist in an effort to determine its relationship with a dependent variable. A dependent variable is a variable that the data scientist is measuring. [box type="note" align="" class="" width=""]It is not uncommon to have more than one independent variable in a data science progression or experiment. [/box] More precisely, regression is the process that helps the data scientist comprehend how the typical value of the dependent variable (or criterion variable) changes when any one or more of the independent variables is varied while the other independent variables are held fixed. Fitting Fitting is the process of measuring how well a statistical model or process describes a data scientist's observations pertaining to a recordset or experiment. These measures will attempt to point out the discrepancy between observed values and probable values. The probable values of a model or process are known as a distribution or a probability distribution. Therefore, a probability distribution fitting (or distribution fitting) is when the data scientist fits a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. The object of a data scientist performing a distribution fitting is to predict the probability or to forecast the frequency of, the occurrence of the phenomenon at a certain interval. [box type="note" align="" class="" width=""]One of the most common uses of fitting is to test whether two samples are drawn from identical distributions.[/box] There are numerous probability distributions a data scientist can select from. Some will fit better to the observed frequency of the data than others will. The distribution giving a close fit is supposed to lead to good predictions; therefore, the data scientist needs to select a distribution that suits the data well. Categorical data Earlier, we explained how variables in your data can be either independent or dependent. Another type of variable definition is a categorical variable. This type of variable is one that can take on one of a limited, and typically fixed, number of possible values, thus assigning each individual to a particular category. Often, the collected data's meaning is unclear. Categorical data is a method that a data scientist can use to put meaning to the data. For example, if a numeric variable is collected (let's say the values found are 4, 10, and 12), the meaning of the variable becomes clear if the values are categorized. Let's suppose that based upon an analysis of how the data was collected, we can group (or categorize) the data by indicating that this data describes university students, and there is the following number of players: 4 tennis players 10 soccer players 12 football players Now, because we grouped the data into categories, the meaning becomes clear. Some other examples of categorized data might be individual pet preferences (grouped by the type of pet), or vehicle ownership (grouped by the style of a car owned), and so on. So, categorical data, as the name suggests, is data grouped into some sort of category or multiple categories. Some data scientists refer to categories as sub-populations of data. [box type="note" align="" class="" width=""]Categorical data can also be data that is collected as a yes or no answer. For example, hospital admittance data may indicate that patients either smoke or do not smoke. [/box] Classification Statistical classification of data is the process of identifying which category (discussed in the previous section) a data point, observation, or variable should be grouped into. The data science process that carries out a classification process is known as a classifier. Read this post: Classification using Convolutional Neural Networks [box type="note" align="" class="" width=""]Determining whether a book is fiction or non-fiction is a simple example classification. An analysis of data about restaurants might lead to the classification of them among several genres. [/box] Clustering Clustering is the process of dividing up the data occurrences into groups or homogeneous subsets of the dataset, not a predetermined set of groups as in classification (described in the preceding section) but groups identified by the execution of the data science process based upon similarities that it found among the occurrences. Objects in the same group (a group is also referred to as a cluster) are found to be more analogous (in some sense or another) to each other than to those objects found in other groups (or found in other clusters). The process of clustering is found to be very common in exploratory data mining and is also a common technique for statistical data analysis. Statistical comparison Simply put, when you hear the term statistical comparison, one is usually referring to the act of a data scientist performing a process of analysis to view the similarities or variances of two or more groups or populations (or recordsets). As a data developer, one might be familiar with various utilities such as FC Compare, UltraCompare, or WinDiff, which aim to provide the developer with a line-by-line comparison of the contents of two or more (even binary) files. In statistics (data science), this process of comparing is a statistical technique to compare populations or recordsets. In this method, a data scientist will conduct what is called an Analysis of Variance (ANOVA), compare categorical variables (within the recordsets), and so on. [box type="note" align="" class="" width=""]ANOVA is an assortment of statistical methods that are used to analyze the differences among group means and their associated procedures (such as variations among and between groups, populations, or recordsets). This method eventually evolved into the Six Sigma dataset comparisons. [/box] Coding Coding or statistical coding is again a process that a data scientist will use to prepare data for analysis. In this process, both quantitative data values (such as income or years of education) and qualitative data (such as race or gender) are categorized or coded in a consistent way. Coding is performed by a data scientist for various reasons such as follows: More effective for running statistical models Computers understand the variables Accountability--so the data scientist can run models blind, or without knowing what variables stand for, to reduce programming/author bias [box type="shadow" align="" class="" width=""]You can imagine the process of coding as the means to transform data into a form required for a system or application. [/box] Distributions The distribution of a statistical recordset (or of a population) is a visualization showing all the possible values (or sometimes referred to as intervals) of the data and how often they occur. When a distribution of categorical data (which we defined earlier in this chapter) is created by a data scientist, it attempts to show the number or percentage of individuals in each group or category. Linking an earlier defined term with this one, a probability distribution, stated in simple terms, can be thought of as a visualization showing the probability of occurrence of different possible outcomes in an experiment. Data mining With data mining, one is usually more absorbed in the data relationships (or the potential relationships between points of data, sometimes referred to as variables) and cognitive analysis. To further define this term, we can say that data mining is sometimes more simply referred to as knowledge discovery or even just discovery, based upon processing through or analyzing data from new or different viewpoints and summarizing it into valuable insights that can be used to increase revenue, cuts costs, or both. Using software dedicated to data mining is just one of several analytical approaches to data mining. Although there are tools dedicated to this purpose (such as IBM Cognos BI and Planning Analytics, Tableau, SAS, and so on.), data mining is all about the analysis process finding correlations or patterns among dozens of fields in the data and that can be effectively accomplished using tools such as MS Excel or any number of open source technologies. [box type="note" align="" class="" width=""]A common technique to data mining is through the creation of custom scripts using tools such as R or Python. In this way, the data scientist has the ability to customize the logic and processing to their exact project needs. [/box] Decision trees A statistical decision tree uses a diagram that looks like a tree. This structure attempts to represent optional decision paths and a predicted outcome for each path selected. A data scientist will use a decision tree to support, track, and model decision making and their possible consequences, including chance event outcomes, resource costs, and utility. It is a common way to display the logic of a data science process. Machine learning Machine learning is one of the most intriguing and exciting areas of data science. It conjures all forms of images around artificial intelligence which includes Neural Networks, Support Vector Machines (SVMs), and so on. Fundamentally, we can describe the term machine learning as a method of training a computer to make or improve predictions or behaviors based on data or, specifically, relationships within that data. Continuing, machine learning is a process by which predictions are made based upon recognized patterns identified within data, and additionally, it is the ability to continuously learn from the data's patterns, therefore continuingly making better predictions. It is not uncommon for someone to mistake the process of machine learning for data mining, but data mining focuses more on exploratory data analysis and is known as unsupervised learning. Machine learning can be used to learn and establish baseline behavioral profiles for various entities and then to find meaningful anomalies. Here is the exciting part: the process of machine learning (using data relationships to make predictions) is known as predictive analytics. Predictive analytics allow the data scientists to produce reliable, repeatable decisions and results and uncover hidden insights through learning from historical relationships and trends in the data. Munging and wrangling The terms munging and wrangling are buzzwords or jargon meant to describe one's efforts to affect the format of data, recordset, or file in some way in an effort to prepare the data for continued or otherwise processing and/or evaluations. With data development, you are most likely familiar with the idea of Extract, Transform, and Load (ETL). In somewhat the same way, a data developer may mung or wrangle data during the transformation steps within an ETL process. Common munging and wrangling may include removing punctuation or HTML tags, data parsing, filtering, all sorts of transforming, mapping, and tying together systems and interfaces that were not specifically designed to interoperate. Munging can also describe the processing or filtering of raw data into another form, allowing for more convenient consumption of the data elsewhere. Munging and wrangling might be performed multiple times within a data science process and/or at different steps in the evolving process. Sometimes, data scientists use munging to include various data visualization, data aggregation, training a statistical model, as well as much other potential work. To this point, munging and wrangling may follow a flow beginning with extracting the data in a raw form, performing the munging using various logic, and lastly, placing the resulting content into a structure for use. Although there are many valid options for munging and wrangling data, preprocessing and manipulation, a tool that is popular with many data scientists today is a product named Trifecta, which claims that it is the number one (data) wrangling solution in many industries. [box type="note" align="" class="" width=""]Trifecta can be downloaded for your personal evaluation from https://www.trifacta.com/. Check it out! [/box] Visualization The main point (although there are other goals and objectives) when leveraging a data visualization technique is to make something complex appear simple. You can think of visualization as any technique for creating a graphic (or similar) to communicate a message. Other motives for using data visualization include the following: To explain the data or put the data in context (which is to highlight demographic statistics) To solve a specific problem (for example, identifying problem areas within a particular business model) To explore the data to reach a better understanding or add clarity (such as what periods of time do this data span?) To highlight or illustrate otherwise invisible data (such as isolating outliers residing in the data) To predict, such as potential sales volumes (perhaps based upon seasonality sales statistics) And others Statistical visualization is used in almost every step in the data science process, within the obvious steps such as exploring and visualizing, analyzing and learning, but can also be leveraged during collecting, processing, and the end game of using the identified insights. D3 D3 or D3.js, is essentially an open source JavaScript library designed with the intention of visualizing data using today's web standards. D3 helps put life into your data, utilizing Scalable Vector Graphics (SVG), Canvas, and standard HTML. D3 combines powerful visualization and interaction techniques with a data-driven approach to DOM manipulation, providing data scientists with the full capabilities of modern browsers and the freedom to design the right visual interface that best depicts the objective or assumption. In contrast to many other libraries, D3.js allows inordinate control over the visualization of data. D3 is embedded within an HTML webpage and uses pre-built JavaScript functions to select elements, create SVG objects, style them, or add transitions, dynamic effects, and so on. Regularization Regularization is one possible approach that a data scientist may use for improving the results generated from a statistical model or data science process, such as when addressing a case of overfitting in statistics and data science. [box type="note" align="" class="" width=""]We defined fitting earlier (fitting describes how well a statistical model or process describes a data scientist's observations). Overfitting is a scenario where a statistical model or process seems to fit too well or appears to be too close to the actual data.[/box] Overfitting usually occurs with an overly simple model. This means that you may have only two variables and are drawing conclusions based on the two. For example, using our previously mentioned example of daffodil sales, one might generate a model with temperature as an independent variable and sales as a dependent one. You may see the model fail since it is not as simple as concluding that warmer temperatures will always generate more sales. In this example, there is a tendency to add more data to the process or model in hopes of achieving a better result. The idea sounds reasonable. For example, you have information such as average rainfall, pollen count, fertilizer sales, and so on; could these data points be added as explanatory variables? [box type="note" align="" class="" width=""]An explanatory variable is a type of independent variable with a subtle difference. When a variable is independent, it is not affected at all by any other variables. When a variable isn't independent for certain, it's an explanatory variable. [/box] Continuing to add more and more data to your model will have an effect but will probably cause overfitting, resulting in poor predictions since it will closely resemble the data, which is mostly just background noise. To overcome this situation, a data scientist can use regularization, introducing a tuning parameter (additional factors such as a data points mean value or a minimum or maximum limitation, which gives you the ability to change the complexity or smoothness of your model) into the data science process to solve an ill-posed problem or to prevent overfitting. Assessment When a data scientist evaluates a model or data science process for performance, this is referred to as assessment. Performance can be defined in several ways, including the model's growth of learning or the model's ability to improve (with) learning (to obtain a better score) with additional experience (for example, more rounds of training with additional samples of data) or accuracy of its results. One popular method of assessing a model or processes performance is called bootstrap sampling. This method examines performance on certain subsets of data, repeatedly generating results that can be used to calculate an estimate of accuracy (performance). The bootstrap sampling method takes a random sample of data, splits it into three files--a training file, a testing file, and a validation file. The model or process logic is developed based on the data in the training file and then evaluated (or tested) using the testing file. This tune and then test process is repeated until the data scientist is comfortable with the results of the tests. At that point, the model or process is again tested, this time using the validation file, and the results should provide a true indication of how it will perform. [box type="note" align="" class="" width=""]You can imagine using the bootstrap sampling method to develop program logic by analyzing test data to determine logic flows and then running (or testing) your logic against the test data file. Once you are satisfied that your logic handles all of the conditions and exceptions found in your testing data, you can run a final test on a new, never-before-seen data file for a final validation test. [/box] Cross-validation Cross-validation is a method for assessing a data science process performance. Mainly used with predictive modeling to estimate how accurately a model might perform in practice, one might see cross-validation used to check how a model will potentially generalize, in other words, how the model can apply what it infers from samples to an entire population (or recordset). With cross-validation, you identify a (known) dataset as your validation dataset on which training is run along with a dataset of unknown data (or first seen data) against which the model will be tested (this is known as your testing dataset). The objective is to ensure that problems such as overfitting (allowing non-inclusive information to influence results) are controlled and also provide an insight into how the model will generalize a real problem or on a real data file. The cross-validation process will consist of separating data into samples of similar subsets, performing the analysis on one subset (called the training set) and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple iterations (also called folds or rounds) of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Typically, a data scientist will use a models stability to determine the actual number of rounds of cross-validation that should be performed. Neural networks Neural networks are also called artificial neural networks (ANNs), and the objective is to solve problems in the same way that the human brain would. Google will provide the following explanation of ANN as stated in Neural Network Primer: Part I, by Maureen Caudill, AI Expert, Feb. 1989: [box type="note" align="" class="" width=""]A computing system made up of several simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. [/box] To oversimplify the idea of neural networks, recall the concept of software encapsulation, and consider a computer program with an input layer, a processing layer, and an output layer. With this thought in mind, understand that neural networks are also organized in a network of these layers, usually with more than a single processing layer. Patterns are presented to the network by way of the input layer, which then communicates to one (or more) of the processing layers (where the actual processing is done). The processing layers then link to an output layer where the result is presented. Most neural networks will also contain some form of learning rule that modifies the weights of the connections (in other words, the network learns which processing nodes perform better and gives them a heavier weight) per the input patterns that it is presented with. In this way (in a sense), neural networks learn by example as a child learns to recognize a cat from being exposed to examples of cats. Boosting In a manner of speaking, boosting is a process generally accepted in data science for improving the accuracy of a weak learning data science process. [box type="note" align="" class="" width=""]Data science processes defined as weak learners are those that produce results that are only slightly better than if you would randomly guess the outcome. Weak learners are basically thresholds or a 1-level decision tree. [/box] Specifically, boosting is aimed at reducing bias and variance in supervised learning. What do we mean by bias and variance? Before going on further about boosting, let's take note of what we mean by bias and variance. Data scientists describe bias as a level of favoritism that is present in the data collection process, resulting in uneven, disingenuous results and can occur in a variety of different ways. A sampling method is called biased if it systematically favors some outcomes over others. A variance may be defined (by a data scientist) simply as the distance from a variable mean (or how far from the average a result is). The boosting method can be described as a data scientist repeatedly running through a data science process (that has been identified as a weak learning process), with each iteration running on different and random examples of data sampled from the original population recordset. All the results (or classifiers or residue) produced by each run are then combined into a single merged result (that is a gradient). This concept of using a random subset of the original recordset for each iteration originates from bootstrap sampling in bagging and has a similar variance-reducing effect on the combined model. In addition, some data scientists consider boosting a means to convert weak learners into strong ones; in fact, to some, the process of boosting simply means turning a weak learner into a strong learner. Lift In data science, the term lift compares the frequency of an observed pattern within a recordset or population with how frequently you might expect to see that same pattern occur within the data by chance or randomly. If the lift is very low, then typically, a data scientist will expect that there is a very good probability that the pattern identified is occurring just by chance. The larger the lift, the more likely it is that the pattern is real. Mode In statistics and data science, when a data scientist uses the term mode, he or she refers to the value that occurs most often within a sample of data. Mode is not calculated but is determined manually or through processing of the data. Outlier Outliers can be defined as follows: A data point that is way out of keeping with the others That piece of data that doesn't fit Either a very high value or a very low value Unusual observations within the data An observation point that is distant from all others Predictive modeling The development of statistical models and/or data science processes to predict future events is called predictive modeling. Big Data Again, we have some variation of the definition of big data. A large assemblage of data, data sets that are so large or complex that traditional data processing applications are inadequate, and data about every aspect of our lives have all been used to define or refer to big data. In 2001, then Gartner analyst Doug Laney introduced the 3V's concept. The 3V's, as per Laney, are volume, variety, and velocity. The V's make up the dimensionality of big data: volume (or the measurable amount of data), variety (meaning the number of types of data), and velocity (referring to the speed of processing or dealing with that data). Confidence interval The confidence interval is a range of values that a data scientist will specify around an estimate to indicate their margin of error, combined with a probability that a value will fall in that range. In other words, confidence intervals are good estimates of the unknown population parameter. Writing Although visualizations grab much more of the limelight when it comes to presenting the output or results of a data science process or predictive model, writing skills are still not only an important part of how a data scientist communicates but still considered an essential skill for all data scientists to be successful. Did we miss any of your favorite terms? Now that you are at the end of this post, we ask you again: On a scale of 1 to 30 (1 being the lowest and 30, the highest), how do you rate yourself as a data scientist? Why You Need to Know Statistics To Be a Good Data Scientist [interview] How data scientists test hypotheses and probability 6 Key Areas to focus on while transitioning to a Data Scientist role Soft skills every data scientist should teach their child
Read more
  • 0
  • 0
  • 6802

article-image-9-data-science-myths-debunked
Amey Varangaonkar
03 Jul 2018
9 min read
Save for later

9 Data Science Myths Debunked

Amey Varangaonkar
03 Jul 2018
9 min read
The benefits of data science are evident for all to see. Not only does it equip you with the tools and techniques to make better business decisions, the predictive power of analytics also allows you to determine future outcomes - something that can prove to be crucial to businesses. Despite all these advantages, data science is a touchy topic for many businesses. It’s worth looking at some glaring stats that show why businesses are reluctant to adopt data science: Poor data across businesses and organizations - in both private and government costs the U.S economy close to $3 Trillion per year. Only 29% enterprises are able to properly leverage the power of Big Data and derive useful business value from it. These stats show a general lack of awareness or knowledge when it comes to data science. It could be due to some preconceived notions, or simply lack of knowledge and its application that seems to be a huge hurdle to these companies. In this article, we attempt to take down some of these notions and give a much clearer picture of what data science really is. Here are 5 of the most common myths or misconceptions in data science, and why are absolutely wrong: Data Science is just a fad, it won’t last long This is probably the most common misconception. Many tend to forget that although ‘data science’ is a recently coined term, this field of study is a cumulation of decades of research and innovation in statistical methodologies and tools. It has been in use since the 1960s or even before - just that the scale at which it was being used then was small. Back in the day, there were no ‘data scientists’, but just statisticians and economists who used the now unknown terms such as ‘data fishing’ or ‘data dredging’. Even the terms ‘data analysis’ and ‘data mining’ only went mainstream in the 1990s, but they were in use way before that period. Data Science’s rise to fame has coincided with the exponential rise in the amount of data being generated every minute. The need to understand this information and make positive use of it led to an increase in the demand for data science. Now with Big Data and Internet of Things going wild, the rate of data generation and the subsequent need for its analysis will only increase. So if you think data science is a fad that will go away soon, think again. Data Science and Business Intelligence are the same Those who are unfamiliar with what data science and Business Intelligence actually entail often get confused, and think they’re one and the same. No, they’re not. Business Intelligence is an umbrella term for the tools and techniques that give answers to the operational and contextual aspects of your business or organization. Data science, on the other hand has more to do with collecting information in order to build patterns and insights. Learning about your customers or your audience is Business Intelligence. Understanding why something happened, or whether it will happen again, is data science. If you want to gauge how changing a certain process will affect your business, data science - not Business Intelligence - is what will help you. Data Science is only meant for large organizations with large resources Many businesses and entrepreneurs are wrongly of the opinion that data science is - or works best - only for large organizations. It is a wrongly perceived notion that you need sophisticated infrastructure to process and get the most value out of your data. In reality, all you need is a bunch of smart people who know how to get the best value of the available data. When it comes to taking a data-driven approach, there’s no need to invest a fortune in setting up an analytics infrastructure for an organization of any scale. There are many open source tools out there which can be easily leveraged to process large-scale data with efficiency and accuracy. All you need is a good understanding of the tools. It is difficult to integrate data science systems with the organizational workflow With the advancement of tech, one critical challenge that has now become very easy to overcome is to collaborate with different software systems at once. With the rise of general-purpose programming languages, it is now possible to build a variety of software systems using a single programming language. Take Python for example. You can use it to analyze your data, perform machine learning or develop neural networks to work on more complex data models. All this while, you can link your web API designed in Python to communicate with these data science systems. There are provisions being made now to also integrate codes written in different programming languages while ensuring smooth interoperability and no loss of latency. So if you’re wondering how to incorporate your analytics workflow in your organizational workflow, don’t worry too much. Data Scientists will be replaced by Artificial Intelligence soon Although there has been an increased adoption of automation in data science, the notion that the work of a data scientist will be taken over by an AI algorithm soon is rather interesting. Currently, there is an acute shortage of data scientists, as this McKinsey Global Report suggests. Could this change in the future? Will automation completely replace human efforts when it comes to data science? Surely machines are a lot better than humans at finding patterns; AI best the best go player, remember. This is what the common perception seems to be, but it is not true. However sophisticated the algorithms become in automating data science tasks, we will always need a capable data scientist to oversee them and fine-tune their performance. Not just that, businesses will always need professionals with strong analytical and problem solving skills with relevant domain knowledge. They will always need someone to communicate the insights coming out of the analysis to non-technical stakeholders. Machines don’t ask questions of data. Machines don’t convince people. Machines don’t understand the ‘why’. Machines don’t have intuition. At least, not yet. Data scientists are here to stay, and their demand is not expected to go down anytime soon. You need a Ph.D. in statistics to be a data scientist No, you don’t. Data science involves crunching numbers to get interesting insights, and it often involves the use of statistics to better understand the results. When it comes to performing some advanced tasks such as machine learning and deep learning, sure, an advanced knowledge of statistics helps. But that does not imply that people who do not have a degree in maths or statistics cannot become expert data scientists. Today, organizations are facing a severe shortage of data professionals capable of leveraging the data to get useful business insights. This has led to the rise of citizen data scientists - meaning professionals who are not experts in data science, but can use the data science tools and techniques to create efficient data models. These data scientists are no experts in statistics and maths, they just know the tool inside out, ask the right questions, and have the necessary knowledge of turning data into insights. Having an expertise of the data science tools is enough Many people wrongly think that learning a statistical tool such as SAS, or mastering Python and its associated data science libraries is enough to get the data scientist tag. While learning a tool or skill is always helpful (and also essential), by no means is it the only requisite to do effective data science. One needs to go beyond the tools and also master skills such as non-intuitive thinking, problem-solving, and knowing the correct practical applications of a tool to tackle any given business problem. Not just that, it requires you to have excellent communication skills to present your insights and findings related to the most complex of analysis to other stakeholders, in a way they can easily understand and interpret. So if you think that a SAS certification is enough to get you a high-paying data science job and keep it, think again. You need to have access to a lot of data to get useful insights Many small to medium-sized businesses don’t adopt a data science framework because they think it takes lots and lots of data to be able to use the analytics tools and techniques. Data when present in bulk, always helps, true, but don’t need hundreds of thousands of records to identify some pattern, or to extract relevant insights. Per IBM, data science is defined by the 4 Vs of data, meaning Volume, Velocity, Veracity and Variety. If you are able to model your existing data into one of these formats, it automatically becomes useful and valuable. Volume is important to an extent, but it’s the other three parameters that add the required quality. More data = more accuracy Many businesses collect large hordes of information and use the modern tools and frameworks available at their disposal for analyzing this data. Unfortunately, this does not always guarantee accurate results. Neither does it guarantee useful actionable insights or more value. Once the data is collected, the preliminary analysis on what needs to be done with the data is required. Then, we use the tools and frameworks at our disposal to extract the relevant insights and built an appropriate data model. These models need to be fine-tuned as per the processes for which they will be used. Then, eventually, we get the desired degree of accuracy from the model. Data in itself is quite useless. It’s how we work on it - more precisely, how effectively we work on it - that makes all the difference. So there you have it! Data science is one of the most popular skills to have in your resume today, but it is important to first clear all the confusions and misconceptions that you may have about it. Lack of information or misinformation can do more harm than good, when it comes to leveraging the power of data science within a business - especially considering it could prove to be a differentiating factor for its success and failure. Do you agree with our list? Do you think there are any other commonly observed myths around data science that we may have missed? Let us know. Read more 30 common data science terms explained Why is data science important? 15 Useful Python Libraries to make your Data Science tasks Easier
Read more
  • 0
  • 0
  • 6578

article-image-ubers-kepler-gl-an-open-source-toolbox-for-geospatial-analysis
Pravin Dhandre
28 Jun 2018
4 min read
Save for later

Uber's kepler.gl, an open source toolbox for GeoSpatial Analysis

Pravin Dhandre
28 Jun 2018
4 min read
Geography Visualization, also called as Geovisualization plays a pivotal role in areas like cartography, geographic information systems, remote sensing and global positioning systems. Uber, a peer-to-peer transportation network company headquartered at California believes in data-driven decision making and hence keeps developing smart frameworks like deck.gl for exploring and visualizing advanced geospatial data at scale. Uber strives to make the data web-based and shareable in real-time across their teams and customers. Early this month, Uber surprised the geospatial market with its newly open-source toolbox, kepler.gl, a geoanalytics tool to gain quick insights from geospatial data with amazing and intuitive visualizations. What’s exactly Kepler.gl is? kepler.gl is a visualization-rich web platform, developed on top of deck.gl, a WebGL-powered data visualization library providing real-time visual analytics of millions of geolocation points. The platform provides visual exploration of geographical data sets along with spatial aggregation of all data points collected. The platform is said to be data-agnostic with a single interface to convert your data into insightful visualizations. https://www.youtube.com/watch?v=i2fRN4e2s0A The platform is very user-friendly where one can just drag the CSV or the GeoJSON files and drop them into the browser to visualize the dataset more intuitively. The platform is supported with different map layers, filtering option, aggregation feature through which you can get the final visualization in an animated format or like a video. The usability of features is so high that you can apply all the metrics available to your data points without much of a hassle. The web platform exhibits high performance where you can get insights from your spatial data in less than 10 minutes and that too in a single window. Another advantage of this framework is it does not involve any sort of coding and hence non-technical users can also reap the benefits by churn valuable insights from the data points. The platform is also equipped with some advanced, complex features such as 2D cartographic plane,a separate dimension for altitude, visibility of height of hexagon and grids. The users seem happy with the new height feature which helps them detect abnormalities and illicit traits in an aggregated map. With the filtering menu, the analysts and engineers can compare their data and have a granular look at their data points. This option also helps in reading the histogram well and one can easily detect outliers and make their dataset more reliable. It  has a feature to add playback to time series data points which makes getting useful information of real time location systems easy. The team at Uber looks at this toolbox with a long-term vision where they are planning to keep adding new features and enhancements to make it highly functional and a single-click visualization dashboard. The team has already announced that they would be powering it up with two major enhancements to the current functionality in next couple of months. They would add support on, More robust exploration: There will be interlinkage between charts and maps, and support for custom charts, maps and widgets like the renowned BI tool Tableau through which it will facilitate analytics teams to unveil deeper insights. Addition of newer geo-analytical capabilities: To support massive datasets, there will be added features on data operations such as polygon aggregation, union of data points, operations like joining and buffering. Companies across different verticals such as Airbnb, Atkins Global, Cityswifter, Mapbox have found great value in kepler.gl offerings and are looking towards engineering their products to leverage this framework. The visualization specialists at these companies have already praised Uber for building such a simple yet fast platform with remarkable capabilities. To get started with kepler.gl, read the documentation available at Github and start creating visualizations and enhance your geospatial data analysis. Top 7 libraries for geospatial analysis Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data Data Visualization with ggplot2
Read more
  • 0
  • 0
  • 5890

article-image-python-tensorflow-excel-and-more-data-professionals-reveal-their-top-tools
Amey Varangaonkar
06 Jun 2018
4 min read
Save for later

Python, Tensorflow, Excel and more - Data professionals reveal their top tools

Amey Varangaonkar
06 Jun 2018
4 min read
Data professionals are constantly on the lookout for the best tools to simplify their data science tasks - be it data acquisition, machine learning, or visualizing the results of the analysis. With so much on their plate already, having robust, efficient tools in the arsenal helps them a lot in reducing the procedural complexities. Not just that, the time taken to do these tasks is considerably reduced as well. But what tools do data professionals rely on to make their lives easier? Thanks to the Skill-up 2018 survey that we recently conducted, we have some interesting observations to share with you! Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. Key Takeaways Python is the most widely used programming language by data professionals Python finds a wide adoption across all spectrums of data science - including data analysis, machine learning, deep learning and data visualization Excel continues to be favored by the data professionals because of its effectiveness and simplicity R is slowly falling behind Python in the race to Data Science supremacy Now, let’s look at these observations, in more depth. Python continues its ascension as the top dog Python’s rise in popularity as well as adoption over the last 3 years has been quite staggering, to say the least. Python’s ease of use, powerful analytical and machine learning capabilities as well as its applications outside of data science make it quite a popular language in the tech community. It thus comes as no surprise that it stood out from the others and was the undisputed choice of language for the data pros. R, on the other hand, seems to be finding it difficult to play catch-up to Python, with less than half the number of votes - despite being the tool of choice for many statisticians and researchers. Is the paradigm shift well and truly on? Is Python edging R out for good? Source: Packt Skill-Up Survey 2018 It is interesting to see SQL as the number 2, but considering the number of people working with databases these days it doesn’t come as a surprise. Also, JavaScript is preferred more than Java, indicating the rising need for web-based dashboards for effective Business Intelligence. Data professionals still love Excel, but Python libraries are taking over Microsoft Excel has traditionally been a highly popular tool for data analysis, especially when dealing with data with hundreds and thousands of records. Excel’s perfect setting for data manipulation and charting continues to be the reason why people still use it for basic-level data analysis, as indicated by our survey. Almost 53% of the respondents prefer having Excel in their analysis toolkit for their day to day tasks. Top libraries, tools and frameworks used by data professionals (Source: Packt Skill-Up Survey 2018) The survey also indicated Python’s rising dominance in the data science domain, with 8 out of the 10 most-used tools for data analysis being Python-based. Python’s offerings for data wrangling, scientific computing, machine learning and deep learning make its libraries the obvious choice for data professionals. Here’s a quick look at  15 useful Python libraries to make the above-mentioned data science tasks easier. Tensorflow and PyTorch are in demand AI’s popularity is soaring with every passing day as it finds applications across all types of industries and business domains. In our survey, we found machine learning and deep learning to be two of the most valuable skills to have for any data scientist, as can be seen from the word cloud below: Word cloud for the most valued skills by data professionals (Source: Packt Skill-Up Survey) Python’s two popular deep learning frameworks - Tensorflow and PyTorch have thus gained a lot of attention and adoption in the recent times. Along with Keras - another Python library - these two libraries are the most used frameworks used by data scientists and ML developers for building efficient machine learning and deep learning models. Which language/libraries do you use for your everyday Data Science tasks? Do you agree with your peers’ choice of tools? Feel free to let us know! Read more Data cleaning is the worst part of data analysis, say data scientists 30 common data science terms explained Top 10 deep learning frameworks
Read more
  • 0
  • 0
  • 5833
article-image-messaging-app-telegram-updated-privacy-policy-open-challenge
Amarabha Banerjee
08 Sep 2018
7 min read
Save for later

Messaging app Telegram's updated Privacy Policy is an open challenge

Amarabha Banerjee
08 Sep 2018
7 min read
Social media companies are facing a lot of heat presently because of their privacy issues. One of them is Facebook. The Cambridge analytica scandal had even prompted a senate hearing for Mark Zuckerberg. On the other end of this spectrum, there is another messaging app known as Telegram, registered in London, United Kingdom, founded by the Russian entrepreneur Pavel Durov. Telegram has been in the news for an absolutely opposite situation. It’s often touted as one of the most secure and secretive messaging apps. The end to end encryption ensures that security agencies across the world have a tough time getting access to any suspicious piece of information. For this reason Russia has banned the use of Telegram app on April 2018. Telegram updated their privacy policies on . These updates have further ensured that Telegram will retain the title of the most secure messaging application in the planet. It’s imperative for any messaging app to get access to our data. But how they choose to use it makes you either vulnerable or secure. Telegram in their latest update have stated that they process personal data on the grounds that such processing caters to the following two goals: Providing effective and innovative Services to our users To detect, prevent or otherwise address fraud or security issues in respect of their provision of Services. The caveat for the second point being the security interests shall not override the space of fundamental rights and freedoms that require protection of personal data. This clause is an excellent example on how applications can prove to be a torchbearer for human rights and basic human privacy amidst glaring loopholes. Telegram have listed the the kind of user data accessed by the app. They are as follows: Basic Account Data Telegram stores basic account user data that includes mobile number, profile name, profile picture and about information, which are needed  to create a Telegram account. The most interesting part of this is Telegram allows you to only keep your username (if you choose to) public. The people who have you in their contact list will see you as you want them to - for example you might be a John Doe in public, but your mom will still see you as ‘Dear Son’ in their contacts. Telegram doesn’t require your real name, gender, age or even your screen name to be your real name. E-mail Address When you enable 2-step-verification for your account or store documents using the Telegram Passport feature, you can opt to set up a password recovery email. This address will only be used to send you a password recovery code if you forget it. They are particular about not sending any unsolicited marketing emails to you. Personal Messages Cloud Chats Telegram stores messages, photos, videos and documents from your cloud chats on their servers so that you can access your data from any of your devices anytime without having to rely on third-party backups. All data is stored heavily encrypted and the encryption keys in each case are stored in several other data centers in different jurisdictions. This way local engineers or physical intruders cannot get access to user data. Secret Chats Telegram has a feature called Secret chats that uses end-to-end encryption. This means that all data is encrypted with a key that only the sender and the recipients know. There is no way for us or anybody else without direct access to your device to learn what content is being sent in those messages. Telegram does not store ‘secret chats’ on their servers. They also do not keep any logs for messages in secret chats, so after a short period of time there is no way of determining who or when you messaged via secret chats. Secret chats are not available in the cloud — you can only access those messages from the device they were sent to or from. Media in Secret Chats When you send photos, videos or files via secret chats, before being uploaded, each item is encrypted with a separate key, not known to the server. This key and the file’s location are then encrypted again, this time with the secret chat’s key — and sent to your recipient. They can then download and decipher the file. This means that the file is technically on one of Telegram’s servers, but it looks like a piece of random indecipherable garbage to everyone except for you and the recipient. This complete process is random and there random data packets are periodically purged from the storage disks too. Public Chats In addition to private messages, Telegram also supports public channels and public groups. All public chats are cloud chats. Like everything else on Telegram, the data you post in public communities is encrypted, both in storage and in transit — but everything you post in public will be accessible to everyone. Phone Number and Contacts Telegram uses phone numbers as unique identifiers so that it is easy for you to switch from SMS and other messaging apps and retain your social graph. But the most important thing is that permissions from the users are a must before the cookies are allowed into your browser. Cookies Telegram promises that the only cookies they use are those to operate and provide their Services on the web. They clearly state that they don’t use cookies for profiling or advertising. Their cookies are small text files that allow them to provide and customize their Services, and provide an enhanced user experience. Also, whether or not to use these cookies is a choice made by the users. So, how does Telegram remain in business? The Telegram business model doesn’t match that of a revenue generating service. The founder Pavel Durov is also the founder of the popular Russian social networking site VK. Telegram doesn’t charge for any messaging services, it doesn’t show ads yet. Some new in app purchase features might be included in the new version. As of now, the main source of revenue for Telegram are donations and mainly the earnings of Pavel Durov himself (from the social networking site VK). What can social networks learn from Telegram? Telegram’s policies elevate privacy standards that many are asking from other social messaging apps. The clamour for stopping the exploitation of user data, using their location details for targeted marketing and advertising campaigns is increasing now. Telegram shows that privacy can be achieved, if intended, in today’s overexposed social media world. But there is are also costs to this level of user privacy and secrecy, that are sometimes not discussed enough. The ISIS members behind the 2015 Paris attacks used Telegram to spread propaganda. ISIS also used the app to recruit the perpetrators of the Christmas market attack in Berlin last year and claimed credit for the massacre. More recently, a Turkish prosecutor found that the shooter behind the New Year’s Eve attack at the Reina nightclub in Istanbul used Telegram to receive directions for it from an ISIS leader in Raqqa. While these incidents can never negate the need for a secure and less intrusive social media platform like Telegram, there should be workarounds and escape routes designed for stopping extremists and terrorist activities. Telegram have assured that all ISIS messaging channels are deleted from their network which is a great way to start. Content moderation, proactive sentiment and pattern recognition and content/account isolation are the next challenges for Telegram. One thing is for sure, Telegram’s continual pursuance of user secrey and user data privacy is throwing an open challenge to others to follow suite. Whether others will oblige or not, only time will tell. To read about Telegram’s updated privacy policies in detail, you can check out the official Telegram Privacy Settings. How to stay safe while using Social Media Time for Facebook, Twitter and other social media to take responsibility or face regulation What RESTful APIs can do for Cloud, IoT, social media and other emerging technologies
Read more
  • 0
  • 0
  • 5231

article-image-analyzing-the-chief-data-officer-role
Aaron Lazar
13 Apr 2018
9 min read
Save for later

GDPR is pushing the Chief Data Officer role center stage

Aaron Lazar
13 Apr 2018
9 min read
Gartner predicted that by 2020 90% of large organizations in regulated industries will have a Chief Data Officer role. With the recent heat around Facebook, Mark Zuckerberg, and the fast-approaching GDPR compliance deadline, it’s quite likely that 2018 will be the year of the Chief Data Officer. This article was first published in October, 2017 and has been updated to keep up with the latest trends in GDPR. In 2014 around 400 CEOs and top business execs were asked how they recognize data as a corporate asset. They responded with a mixed set of reactions and viewed the worth of data in their organization in varied ways. Now, in 2018 these reactions have drastically changed – more and more organizations have realized the importance of Data-as-an-Asset. More importantly, the European Union (EU) has made it mandatory that General Data Protection Regulation (GDPR) compliance be sought by the 25th of May, 2018. The primary reason for creating the Chief Data Officer role was to connect the ends between functional management and IT teams in an organization. But now, it looks like the CDO will also be primarily focusing on setting up and driving GDPR compliance, to avoid a fine up to €20 million or 4% of the annual global turnover of the previous year, whichever is higher. We’re going to spend a few minutes breaking down the Chief Data Officer role for you, revealing several interesting insights along the way. Let’s start with the obvious question. What might the Chief Data Officer’s responsibilities be? Like other C-suite execs, a Chief Data Officer is expected to have a well-blended mix of technical know-how and business acumen. Their role is very diverse and sometimes comes across as a pain to define the scope. Here are some of the key responsibilities of a Chief Data Officer: Data Policies and GDPR Compliance Data security is one of the most important elements that any business must consider. It needs to comply with regulatory standards and requirements of the country where it operates. A Chief Data Officer is be responsible for ensuring the compliance of policies across all branches of business and the associated compliance requirement taxonomies, on a global level. What is GDPR? If you’re thinking there were no data protection laws before the General Data Protection Regulation 2016/679, that’s not true. The major difference, however, is that GDPR focuses more on customer data privacy and protection. GDPR requirements will change the way organizations store, process and protect their customers’ personal data. They will need solutions for assessing, implementing and maintaining GDPR compliance, and that’s where Chief Data Officers fit in. Using data to gain a competitive edge Chief Data Officers need to have sound knowledge of the business’ customers, markets where they operate, and strong analytical skills to leverage the right data, at the right time, at the right place. This would eventually give the business an edge over its competitors in the market. For example, a ferry service could use data to identify the rates that customers would be willing to pay at a certain time of the day. Setting in motion the best practices of data governance Organizations span across the globe these days and often employees from different parts of the world work on the same data. This can often result in data moving through unconnected systems, and ending up as inefficient or disjointed pieces of business information. A Chief Data Officer needs to ensure that this information is aggregated and maintained in such a way that clear information ownership across the organization is established. Architecting future-proof data solutions A Chief Data Officer often acts like a Data Architect. They will sometimes take on responsibility for planning, designing, and building Big Data systems and ensuring successful integration with other systems in an organization. Designing systems that can provide answers to the user’s problems now and in the future is vital. Chief Data Officers are often found asking themselves key questions like how to generate data with maximum reusability, while also making sure it’s as accurate and relevant as possible. Defining Information Management tools Different business units across the globe tend to use different tools, technologies, and systems to work on, store, and share information on an enterprise level. This greatly affects a company’s ability to access and leverage data for effective decision making and various other duties. A Chief Data Officer is responsible for establishing data-oriented standards across the business and for driving all arms of the business to comply with the standards and embrace change, to ensure the integrity of data. Spotting new opportunities A Chief Data Officer is responsible for spotting new opportunities where the business can venture into through careful analysis of data and past records. For example, a motor company would leverage certain sales information to make an informed decision on what age group to target with their new SUV in-the-making, to maximize sales. These are just the tip of the iceberg when it comes to a Chief Data Officer’s responsibilities. Responsibilities go hand-in-hand with skill and traits required to execute those responsibilities. Below are some key capabilities sought after in a CDO. Key Chief Data Officer Skills The right person for the job is expected to possess impeccable leadership and C-Suite level communication skills, as well as strong business acumen. They are expected to have strong knowledge of GDPR software tools and solutions to enable the organizations hiring them to swiftly transform and adopt the new regulations. They are also expected to possess knowledge of IT architecture, including a familiarity with leading architectural standards such as TOGAF or the Zachman framework. They need to be experienced in driving data governance as well as data quality and integrity, while also possessing a strong knowledge of data analytics, visualization, and storytelling. Familiar with Big Data solutions like Hadoop, MapReduce, and HBase is a plus. How much do Chief Data Officers earn?  Now, let’s take a look at what kind of compensation a Chief Data Officer is likely to be offered. To tell you the truth, the answer to this question is still a bit hazy, but it’s sure to pick up speed with the recent developments in the regulatory and legal areas related to data. About a year ago, a blog post from careeraddict revealed that the salary for a CDO in the US was around $112,000 annually. A job listing seen on Indeed quoted $200,000 as the annual salary. Indeed shows 7 jobs posted for a CDO in the last 15 days. We took the estimated salaries of CDOs and compared them with those of CIOs and CTOs in the same company. It turns out that most were on par, with a few CDO compensations falling slightly short of the CIO and CTO salary. These are just basic salary figures. Bonuses add on the side, amounting up to 50% in some cases. However, please note that these salary figures vary heavily based on the type of organization and the industry. Do businesses even need a Chief Data Officer? One might argue that some of the skills expected of a Chief Data Officer would also be held by the CIO or the Chief Digital Officer of the organization or the Data Protection Officer (if they have one). Then, why have a Chief Data Officer at all and incur an extra significant cost to the company? With the rapid change in tech and the rate at which data is generated, used, and discarded, most data pointers, point in the direction of having a separate Chief Data Officer, working alongside the CIO. It’s critical to have a clearly defined need for both roles to co-exist. Blurring the boundaries of the two roles can be detrimental and organizations must, therefore, be painstakingly mindful of the defined KRAs. The organisation should clearly define the two roles to keep the business structure running smoothly. A Chief Data Officer’s main focus will be on the latest data-centric technological innovations, their compliance to the new standards while also boosting customer engagement, privacy and in turn, loyalty and the business’s competitive advantage. The CIO, on the other hand, focuses on improving the bottom line by owning business productivity metrics, cost-cutting initiatives, making IT investments etc. – i.e., an inward facing data-management and architecture role. The CIO is the person who is therefore responsible for leading digital initiatives at a board level. In addition to managing data and governing information, if the CIO’s responsibilities were to also include implementing analytics in fresh ways to generate value for the business, it is going to be a tall order. To put it simply, it is more practical for the CIO to own the systems and the CDO to oversee all the bits and bytes that flow through these systems. Moreover, in several cases, the Chief Data Officer will act as a liaison between the business and IT. Thus, Chief Data Officers and CIOs both need to work together and support each other for a better functioning business. The bottom line: a Chief Data Officer is essential For an organization dealing with a lot of data, a Chief Data Officer is a must. Failure to have one on board can result in being fined €10 million euros or 2% of the organization’s worldwide turnover (depending on which is higher). Here are the criteria for an organization to have a dedicated personnel managing Data Protection. The organization’s core activities should: Have data processing operations which require regular and systematic monitoring of data subjects on a large scale or monitoring of individuals Be processing a large scale of special categories of data (i.e. sensitive data such as health, religion, race, sexual orientation etc.) Have data processing carried out by a public authority or a body processing personal data, except for courts operating in their judicial capacity Apart from this mandate, a Chief Data Officer can add immense value by aligning data-driven insights with its vision and goals. A CDO can bridge the gap between the CMO and the CIO, by focusing on meeting customer requirements through data-driven products. For those in data and insights centric roles such as data scientists, data engineers, data analysts and others, the CDO is a natural destination for their career progression journey. The Chief Data Officer role is highly attractive in terms of the scope of responsibilities, the capabilities and of course, the pay. Certification courses like this one are popping up to help individuals shape themselves for the role. All-in-all, this new C-suite position in most organizations, is the perfect pivot between old and new, bridging silos, and making a future where data privacy is intact.
Read more
  • 0
  • 0
  • 5008

article-image-data-science-for-non-techies-how-i-got-started
Amey Varangaonkar
20 Jul 2018
7 min read
Save for later

Data science for non-techies: How I got started (Part 1)

Amey Varangaonkar
20 Jul 2018
7 min read
As a category manager, I manage the data science portfolio of product ideas for Packt Publishing, a leading tech publisher. In simple terms, I place informed bets on where to invest, what topics to publish on etc.  While I have a decent idea of where the industry is heading and what data professionals are looking forward to learn and why etc, it is high time I walked in their shoes for a couple of reasons. Basically, I want to understand the reason behind Data Science being the ‘Sexiest job of the 21st century’, and if the role is really worth all the fame and fortune. In the process, I also wanted to explore the underlying difficulties, challenges and obstacles that every data scientist has had to endure at some point in his/her journey, or still does, maybe. The cherry on top, is that I get to use the skills I develop, to supercharge my success in my current role that is primarily insight-driven. This is the first of a series of posts on how I got started with Data Science. Today, I’m sharing my experience with devising a learning path and then gathering appropriate learning resources. Devising a learning path To understand the concepts of data science, I had to research a lot. There are tons and tons of resources out there, many of which are very good. Once you seperate the good from the rest, it can be quite intimidating to pick the options that suit you the best. Some of the primary questions that clouded my mind were: What should be my programming language of choice? R or Python? Or something else? What tools and frameworks do I need to learn? What about the statistics and mathematical aspects of machine learning? How essential are they? Two videos really helped me find the answers to the questions above: If you don’t want to spend a lot of your time mastering the art of data science, there’s a beautiful video on how to become a data scientist in six months What are the questions asked in a data science interview? What are the in-demand skills that you need to master in order to get a data science job? This video on 5 Tips For Getting a Data Science Job really is helpful. After a lot of research that included reading countless articles and blogs and discussions with experts, here is my learning plan: Learn Python Per the recently conducted Stack Overflow Developer Survey 2018, Python stood out as the most-wanted programming language, meaning the developers who do not use it yet want to learn it the most. As one of the most widely used general-purpose programming languages, Python finds large applications when it comes to data science. Naturally, you get attracted to the best option available, and Python was the one for me. The major reasons why I chose to learn Python over the other programming languages: Very easy to learn: Python is one of the easiest programming languages to learn. Not only is the syntax clean and easy to understand, even the most complex of data science tasks can be done in a few lines of Python code. Efficient libraries for Data Science: Python has a vast array of libraries suited for various data science tasks, from scraping data to visualizing and manipulating it. NumPy, SciPy, pandas, matplotlib, Seaborn are some of the libraries worth mentioning here. Python has terrific libraries for machine learning: Learning a framework or a library which makes machine learning easier to perform is very important. Python has libraries such as scikit-learn and Tensorflow that makes machine learning easier and a fun-to-do activity. To make the most of these libraries, it is important to understand the fundamentals of Python. My colleague and good friend Aaron has put out a list of top 7 Python programming books which helped as a brilliant starting point to understand the different resources out there to learn Python. The one book that stood out for me was Learn Python Programming - Second Edition - This is a very good book to start Python programming from scratch. There is also a neat skill-map present on Mapt, where you can progressively build up your knowledge of Python - right from the absolute basics to the most complex concepts. Another handy resource to learn the A-Z of Python is Complete Python Masterclass. This is a slightly long course, but it will take you from the absolute fundamentals to the most advanced aspects of Python programming. Task Status: Ongoing Learn the fundamentals of data manipulation After learning the fundamentals of Python programming, the plan is to head straight to the Python-based libraries for data manipulation, analysis and visualization. Some of the major ones are what we already discussed above, and the plan to learn them is in the following order: NumPy - Used primarily for numerical computing pandas - One of the most popular Python packages for data manipulation and analysis matplotlib - The go-to Python library for data visualization, rivaling the likes of R’s ggplot2 Seaborn - A data visualization library that runs on top of matplotlib used for creating visually appealing charts, plots and histograms Some very good resources to learn about all these libraries: Python Data Analysis Python for Data Science and Machine Learning - This is a very good course with a detailed coverage on the machine learning concepts. Something to learn later. The aim is to learn these libraries upto a fairly intermediate level, and be able to manipulate, analyze and visualize any kind of data, including missing, unstructured data and time-series data. Understand the fundamentals of statistics, linear algebra and probability In order to take a step further and enter into the foray of machine learning, the general consensus is to first understand the maths and statistics behind the concepts of machine learning. Implementing them in Python is relatively easier once you get the math right, and that is what I plan to do. I shortlisted some very good resources for this as well: Statistics for Machine Learning Stanford University - Machine Learning Course at Coursera Task Status: Ongoing Learn Machine Learning (Sounds odd I know) After understanding the math behind machine learning, the next step is to learn how to perform predictive modeling using popular machine learning algorithms such as linear regression, logistic regression, clustering, and more. Using real-world datasets, the plan is to learn the art of building state-of-the-art machine learning models using Python’s very own scikit-learn library, as well as the popular Tensorflow package. To learn how to do this, the courses I mentioned above should come in handy: Stanford University - Machine Learning Course at Coursera Python for Data Science and Machine Learning Python Machine Learning, Second Edition Task Status: To be started [box type="shadow" align="" class="" width=""]During the course of this journey, websites like Stack Overflow and Stack Exchange will be my best friends, along with the popular resources such as YouTube.[/box] As I start this journey, I plan to share my experiences and knowledge with you all. Do you think the learning path looks good? Is there anything else that I should include in my learning path? I would really love to hear your comments, suggestions and experiences. Stay tuned for the next post where I seek answers to questions such as ‘How much of Python should I learn in order to be comfortable with Data Science?’, ‘How much time should I devote per day or week to learn the concepts in Data Science?’ and much more.. Read more Why is data science important? 9 Data Science Myths Debunked 30 common data science terms explained
Read more
  • 0
  • 0
  • 4689
article-image-top-five-questions-to-ask-when-evaluating-a-data-monitoring-solution
Guest Contributor
27 Oct 2018
6 min read
Save for later

Top five questions to ask when evaluating a Data Monitoring solution

Guest Contributor
27 Oct 2018
6 min read
Massive changes are happening around the way IT services are consumed and delivered. Cloud-based infrastructure is being tied together and instrumented by DevOps processes, while microservices-driven apps are replacing monolithic architectures. This evolution is driving the need for greater monitoring and better analysis of data than we have ever seen before. This need is compounded by the fact that an application today may be instrumented with the help of sensors and devices providing users with critical input in making decisions. Why is there a need for monitoring and analysis? The placement of sensors on practically every available surface in the material world – from machines to humans – is a reality today. Almost anything that is capable of giving off a measurable metric or recorded event can be instrumented, in the virtual world as well as the physical world, and has the need for monitoring. Metrics involve the consistent measurement of characteristics, such as CPU usage, while events are something that is triggered, such as temperature reaching above a threshold. The right instrumentation, observation and analytics are required to create business insight from the myriad of data points coming from these instruments. In the virtual world, monitoring and controlling software components that drive business processes is critical. Data monitoring in software is an important aspect of visualizing what systems are doing – what activities are happening, and precisely when – and how well the applications and services are performing. There is, of course, a business justification for all this monitoring of constant streams of metrics and events data. Companies want to become more data-driven, they want to apply data insights to be better situationally aware of business opportunities and threats. A data-driven organization is able to predict outcomes more effectively than relying on historical information, or on gut instinct. When vast amounts of data points are monitored and analyzed, the organization can find interesting “business moments” in the data. These insights help identify emerging opportunities and competitive advantages. How to develop a Data monitoring strategy Establishing an overall IT monitoring strategy that works for everyone across the board is nearly impossible. But it is possible to develop a monitoring strategy which is uniquely tailored to specific IT and business needs. At a high level, organizations can start developing their Data monitoring strategy by asking these five fundamental questions: #1 Have we considered all stakeholder needs? One of the more common mistakes DevOps teams make is focusing the monitoring strategies on the needs of just a few stakeholders and not addressing the requirements of stakeholders outside of IT operations, such as line of business (LOB) owners, application developers and owners, and other subgroups within operations, such as network operations (NOC) or communications teams. For example, an app developer may need usage statistics around application performance while the network operator might be interested in network bandwidth usage by that app’s users. #2 Will the data capture strategy meet future needs? Organizations, of course, must key on the data capture needs of today at the enterprise level, but at the same time, must consider the future. Developing a long-term plan helps in future-proofing the overall strategy since data formats and data exchange protocols always evolve. The strategy should also consider future needs around ingestion and query volumes. Planning for how much data will be generated, stored and archived will help establish a better long-term plan. #3 Will the data analytics satisfy my organization’s evolving needs? Data analysis needs always change over time. Stakeholders will ask for different types of analysis and planning ahead for those needs, and opting for a flexible data analysis strategy will help ensure that the solution is able to support future needs. #4 Is the presentation layer modular and embeddable? A flexible user interface that addresses the needs of all stakeholders is important for meeting the organization’s overarching goals. Solutions which deliver configurable dashboards that enable users to specify queries for custom dashboards meet this need for flexibility. Organizations should consider a plug-and-play model which allows users to choose different presentation layers as needed. #5 Does architecture enable smart actions? The ability to detect anomalies and trigger specific actions is a critical part of a monitoring strategy. A flexible and extensible model should be used to meet the notification preferences of diverse user groups. Organizations should consider self-learning models which can be trained to detect undefined anomalies from the collected data. Monitoring solutions which address the broader monitoring needs of the entire enterprise are preferred. What are purpose-built monitoring platforms Devising an overall IT monitoring strategy that meets these needs and fundamental technology requirements is a tall order. But new purpose-built monitoring platforms have been created to deal with today’s new requirements for monitoring and analyzing these specific metrics and events workloads – often called time-series data – and provide situational awareness to the business. These platforms support ingesting millions of data points per second, can scale both horizontally and vertically, are designed from the ground up to support real-time monitoring and decision making, and have strong machine learning and anomaly detection functions to aid in discovering interesting business moments. In addition, they are resource-aware, applying compression and down-sampling functions to aid in optimal resource utilization, and are built to support faster time to market with minimal dependencies. With the right strategy in mind, and tools in place, organizations can address the evolving monitoring needs of the entire organization. About the Author Mark Herring is the CMO of InfluxData. He is a passionate marketeer with a proven track record of generating leads, building pipeline, and building vibrant developer and open source communities. Data-driven marketeer with proven ability to define the forest from the trees, improve performance, and deliver on strategic imperatives. Prior to InfluxData, Herring was vice president of corporate marketing and developer marketing at Hortonworks where he grew the developer community by over 40x. Herring brings over 20 years of relevant marketing experience from his roles at Software AG, Sun, Oracle, and Forte Software. TensorFlow announces TensorFlow Data Validation (TFDV) to automate and scale data analysis, validation, and monitoring. How AI is going to transform the Data Center. Introducing TimescaleDB 1.0, the first OS time-series database with full SQL support.
Read more
  • 0
  • 0
  • 4677

article-image-what-is-statistical-analysis-and-why-does-it-matter
Sugandha Lahoti
02 Oct 2018
6 min read
Save for later

What is Statistical Analysis and why does it matter?

Sugandha Lahoti
02 Oct 2018
6 min read
As a data developer, the concept or process of data analysis may be clear to your mind. However, although there happen to be similarities between the art of data analysis and that of statistical analysis, there are important differences to be understood as well. This article is taken from the book Statistics for Data Science by James D. Miller. This book takes you through an entire journey of statistics, from knowing very little to becoming comfortable in using various statistical methods for data science tasks. In this article, we've broken things into the following topics: What is statistical analysis and it's best practices? How to establish the nature of data? What is Statistical analysis? Some in the study of statistics sometimes describe statistical analysis as part of statistical projects that involves the collection and scrutiny of a data source in an effort to identify trends within the data. With data analysis, the goal is to validate that the data is appropriate for a need, and with statistical analysis, the goal is to make sense of, and draw some inferences from, the data. There is a wide range of possible statistical analysis techniques or approaches that can be considered. How to perform a successful statistical analysis It is worthwhile to mention some key points, dealing with ensuring a successful (or at least productive) statistical analysis effort. As soon as you can, decide on your goal or objective. You need to know what the win is, that is, what the problem or idea is that is driving the analysis effort. In addition, you need to make sure that, whatever is driving the analysis, the result obtained must be measurable in some way. This metric or performance indicator must be identified early. Identify key levers. This means that once you have established your goals and a way to measure performance towards obtaining those goals, you also need to find out what has an effect on the performance towards obtaining each goal. Conduct a thorough data collection. Typically, the more data the better, but in the absence of quantity, always go with quality. Clean your data. Make sure your data has been cleaned in a consistent way so that data issues would not impact your conclusions. Model, model, and model your data. Modeling drives modeling. The more you model your data, the more questions you'll have asked and answered, and the better results you'll have. Take time to grow in your statistical analysis skills. It's always a good idea to continue to evolve your experiences and style of statistical analysis. The way to improve is to do it. Another approach is to remodel the data you may have on hand for other projects to hone your skills. Optimize and repeat. As always, you need to take the time for standardizing, following proven practices, using templates, and testing and documenting your scripts and models, so that you can re-use your best efforts over and over again. You will find that this time will be well spent and even your better efforts will improve with use. Finally, share your work with others! The more eyes, the better the product. Some interesting advice on ensuring success with statistical projects includes the following quote: It's a good idea to build a team that allows those with an advanced degree in statistics to focus on data modeling and predictions, while others in the team-qualified infrastructure engineers, software developers and ETL experts-build the necessary data collection infrastructure, data pipeline and data products that enable streaming the data through the models and displaying the results to the business in the form of reports and dashboards.                                                                                                            - G Shapira, 2017 Establishing the nature of data When asked about the objectives of statistical analysis, one often refers to the process of describing or establishing the nature of a data source. Establishing the nature of something implies gaining an understanding of it. This understanding can be found to be both simple as well as complex. For example, can we determine the types of each of the variables or components found within our data source; are they quantitative, comparative, or qualitative? A more advanced statistical analysis aims to identify patterns in data; for example, whether there is a relationship between the variables or whether certain groups are more likely to show certain attributes than others. Exploring the relationships presented in data may appear to be similar to the idea of identifying a foreign key in a relational database, but in statistics, relationships between the components or variables are based upon correlation and causation. Further, establishing the nature of a data source is also, really, a process of modeling that data source. During modeling, the process always involves asking questions such as the following (in an effort establish the nature of the data): What? Some common examples of this (what) are revenue, expenses, shipments, hospital visits, website clicks, and so on. In the example, we are measuring quantities, that is, the amount of product that is being moved (sales). Why? This (why) will typically depend upon your project's specific objectives, which can vary immensely. For example, we may want to track the growth of a business, the activity on a website, or the evolution of a selected product or market interest. Again, in our current transactional data example, we may want to identify over- and under-performing sales types, and determine if, new or repeat customers provide more or fewer sales? How? The how will most likely be over a period of time (perhaps a year, month, week, and so on) and then by some other related measure, such as a product, state, region, reseller, and so on. Within our transactional data example, we've focused on the observation of quantities by sale type. Another way to describe establishing the nature of your data is adding context to it or profiling it. In any case, the objective is to allow the data consumer to better understand the data through visualization. Another motive for adding context or establishing the nature of your data can be to gain a new perspective on the data. In this article, we explored the purpose and process of statistical analysis and listed the steps involved in a successful statistical analysis. Next, to learn about statistical regression and why it is important to data science, read our book Statistics for Data Science. Estimating population statistics with Point Estimation. Why You Need to Know Statistics To Be a Good Data Scientist. Why choose IBM SPSS Statistics over R for your data analysis project.
Read more
  • 0
  • 0
  • 4396