Data Analysis | 34 articles | Tech News, Tutorials & Expert Insights

article-image-libraries-for-geospatial-analysis

22 May 2018

12 min read

Top 7 libraries for geospatial analysis

22 May 2018

The term geospatial refers to finding information that is located on the earth's surface. This can include, for example, the position of a cellphone tower, the shape of a road, or the outline of a country. Geospatial data often associates some piece of information with a particular location. Geospatial development is the process of writing computer programs that can access, manipulate, and display this type of information. Internally, geospatial data is represented as a series of coordinates, often in the form of latitude and longitude values. Additional attributes, such as temperature, soil type, height, or the name of a landmark, are also often present. There can be many thousands (or even millions) of data points for a single set of geospatial data. In addition to the prosaic tasks of importing geospatial data from various external file formats and translating data from one projection to another, geospatial data can also be manipulated to solve various interesting problems. Obvious examples include the task of calculating the distance between two points, calculating the length of a road, or finding all data points within a given radius of a selected point. We use libraries to solve all of these problems and more. Today we will look at the major libraries used to process and analyze geospatial data. GDAL/OGR GEOS Shapely Fiona Python Shapefile Library (pyshp) pyproj Rasterio GeoPandas This is an excerpt from the book, Mastering Geospatial Analysis with Python by Paul Crickard, Eric van Rees, and Silas Toms. Geospatial Data Abstraction Library (GDAL) and the OGR Simple Features Library The Geospatial Data Abstraction Library (GDAL)/OGR Simple Features Library combines two separate libraries that are generally downloaded together as a GDAL. This means that installing the GDAL package also gives access to OGR functionality. The reason GDAL is covered first is that other packages were written after GDAL, so chronologically, it comes first. As you will notice, some of the packages covered in this post extend GDAL's functionality or use it under the hood. GDAL was created in the 1990s by Frank Warmerdam and saw its first release in June 2000. Later, the development of GDAL was transferred to the Open Source Geospatial Foundation (OSGeo). Technically, GDAL is a little different than your average Python package as the GDAL package itself was written in C and C++, meaning that in order to be able to use it in Python, you need to compile GDAL and its associated Python bindings. However, using conda and Anaconda makes it relatively easy to get started quickly. Because it was written in C and C++, the online GDAL documentation is written in the C++ version of the libraries. For Python developers, this can be challenging, but many functions are documented and can be consulted with the built-in pydoc utility, or by using the help function within Python. Because of its history, working with GDAL in Python also feels a lot like working in C++ rather than pure Python. For example, a naming convention in OGR is different than Python's since you use uppercase for functions instead of lowercase. These differences explain the choice for some of the other Python libraries such as Rasterio and Shapely, which are also covered in this chapter, that has been written from a Python developer's perspective but offer the same GDAL functionality. GDAL is a massive and widely used data library for raster data. It supports the reading and writing of many raster file formats, with the latest version counting up to 200 different file formats that are supported. Because of this, it is indispensable for geospatial data management and analysis. Used together with other Python libraries, GDAL enables some powerful remote sensing functionalities. It's also an industry standard and is present in commercial and open source GIS software. The OGR library is used to read and write vector-format geospatial data, supporting reading and writing data in many different formats. OGR uses a consistent model to be able to manage many different vector data formats. You can use OGR to do vector reprojection, vector data format conversion, vector attribute data filtering, and more. GDAL/OGR libraries are not only useful for Python programmers but are also used by many GIS vendors and open source projects. The latest GDAL version at the time of writing is 2.2.4, which was released in March 2018. GEOS The Geometry Engine Open Source (GEOS) is the C/C++ port of a subset of the Java Topology Suite (JTS) and selected functions. GEOS aims to contain the complete functionality of JTS in C++. It can be compiled on many platforms, including Python. As you will see later on, the Shapely library uses functions from the GEOS library. In fact, there are many applications using GEOS, including PostGIS and QGIS. GeoDjango, also uses GEOS, as well as GDAL, among other geospatial libraries. GEOS can also be compiled with GDAL, giving OGR all of its capabilities. The JTS is an open source geospatial computational geometry library written in Java. It provides various functionalities, including a geometry model, geometric functions, spatial structures and algorithms, and i/o capabilities. Using GEOS, you have access to the following capabilities—geospatial functions (such as within and contains), geospatial operations (union, intersection, and many more), spatial indexing, Open Geospatial Consortium (OGC) well-known text (WKT) and well-known binary (WKB) input/output, the C and C++ APIs, and thread safety. Shapely Shapely is a Python package for manipulation and analysis of planar features, using functions from the GEOS library (the engine of PostGIS) and a port of the JTS. Shapely is not concerned with data formats or coordinate systems but can be readily integrated with such packages. Shapely only deals with analyzing geometries and offers no capabilities for reading and writing geospatial files. It was developed by Sean Gillies, who was also the person behind Fiona and Rasterio. Shapely supports eight fundamental geometry types that are implemented as a class in the shapely.geometry module—points, multipoints, linestrings, multilinestrings, linearrings, multipolygons, polygons, and geometrycollections. Apart from representing these geometries, Shapely can be used to manipulate and analyze geometries through a number of methods and attributes. Shapely has mainly the same classes and functions as OGR while dealing with geometries. The difference between Shapely and OGR is that Shapely has a more Pythonic and very intuitive interface, is better optimized, and has a well-developed documentation. With Shapely, you're writing pure Python, whereas with GEOS, you're writing C++ in Python. For data munging, a term used for data management and analysis, you're better off writing in pure Python rather than C++, which explains why these libraries were created. For more information on Shapely, consult the documentation. This page also has detailed information on installing Shapely for different platforms and how to build Shapely from the source for compatibility with other modules that depend on GEOS. This refers to the fact that installing Shapely will require you to upgrade NumPy and GEOS if these are already installed. Fiona Fiona is the API of OGR. It can be used for reading and writing data formats. The main reason for using it instead of OGR is that it's closer to Python than OGR as well as more dependable and less error-prone. It makes use of two markup languages, WKT and WKB, for representing spatial information with regards to vector data. As such, it can be combined well with other Python libraries such as Shapely, you would use Fiona for input and output, and Shapely for creating and manipulating geospatial data. While Fiona is Python compatible and our recommendation, users should also be aware of some of the disadvantages. It is more dependable than OGR because it uses Python objects for copying vector data instead of C pointers, which also means that they use more memory, which affects the performance. Python shapefile library (pyshp) The Python shapefile library (pyshp) is a pure Python library and is used to read and write shapefiles. The pyshp library's sole purpose is to work with shapefiles—it only uses the Python standard library. You cannot use it for geometric operations. If you're only working with shapefiles, this one-file-only library is simpler than using GDAL. pyproj The pyproj is a Python package that performs cartographic transformations and geodetic computations. It is a Cython wrapper to provide Python interfaces to PROJ.4 functions, meaning you can access an existing library of C code in Python. PROJ.4 is a projection library that transforms data among many coordinate systems and is also available through GDAL and OGR. The reason that PROJ.4 is still popular and widely used is two-fold: Firstly, because it supports so many different coordinate systems Secondly, because of the routes it provides to do this—Rasterio and GeoPandas, two Python libraries covered next, both use pyproj and thus PROJ.4 functionality under the hood The difference between using PROJ.4 separately instead of using it with a package such as GDAL is that it enables you to re-project individual points, and packages using PROJ.4 do not offer this functionality. The pyproj package offers two classes—the Proj class and the Geod class. The Proj class performs cartographic computations, while the Geod class performs geodetic computations. Rasterio Rasterio is a GDAL and NumPy-based Python library for raster data, written with the Python developer in mind instead of C, using Python language types, protocols, and idioms. Rasterio aims to make GIS data more accessible to Python programmers and helps GIS analysts learn important Python standards. Rasterio relies on concepts of Python rather than GIS. Rasterio is an open source project from the satellite team of Mapbox, a provider of custom online maps for websites and applications. The name of this library should be pronounced as raster-i-o rather than ras-te-rio. Rasterio came into being as a result of a project called the Mapbox Cloudless Atlas, which aimed to create a pretty-looking basemap from satellite imagery. One of the software requirements was to use open source software and a high-level language with handy multi-dimensional array syntax. Although GDAL offers proven algorithms and drivers, developing with GDAL's Python bindings feels a lot like C++. Therefore, Rasterio was designed to be a Python package at the top, with extension modules (using Cython) in the middle, and a GDAL shared library on the bottom. Other requirements for the raster library were being able to read and write NumPy ndarrays to and from data files, use Python types, protocols, and idioms instead of C or C++ to free programmers from having to code in two languages. For georeferencing, Rasterio follows the lead of pyproj. There are a couple of capabilities added on top of reading and writing, one of them being a features module. Reprojection of geospatial data can be done with the rasterio.warp module. Rasterio's project homepage can be found on Github. GeoPandas GeoPandas is a Python library for working with vector data. It is based on the pandas library that is part of the SciPy stack. SciPy is a popular library for data inspection and analysis, but unfortunately, it cannot read spatial data. GeoPandas was created to fill this gap, taking pandas data objects as a starting point. The library also adds functionality from geographical Python packages. GeoPandas offers two data objects—a GeoSeries object that is based on a pandas Series object and a GeoDataFrame, based on a pandas DataFrame object, but adding a geometry column for each row. Both GeoSeries and GeoDataFrame objects can be used for spatial data processing, similar to spatial databases. Read and write functionality is provided for almost every vector data format. Also, because both Series and DataFrame objects are subclasses from pandas data objects, you can use the same properties to select or subset data, for example .loc or .iloc. GeoPandas is a library that employs the capabilities of newer tools, such as Jupyter Notebooks, pretty well, whereas GDAL enables you to interact with data records inside of vector and raster datasets through Python code. GeoPandas takes a more visual approach by loading all records into a GeoDataFrame so that you can see them all together on your screen. The same goes for plotting data. These functionalities were lacking in Python 2 as developers were dependent on IDEs without extensive data visualization capabilities which are now available with Jupyter Notebooks. We've provided an overview of the most important open source packages for processing and analyzing geospatial data. The question then becomes when to use a certain package and why. GDAL, OGR, and GEOS are indispensable for geospatial processing and analyzing, but were not written in Python, and so they require Python binaries for Python developers. Fiona, Shapely, and pyproj were written to solve these problems, as well as the newer Rasterio library. For a more Pythonic approach, these newer packages are preferable to the older C++ packages with Python binaries (although they're used under the hood). Now that you have an idea of what options are available for a certain use case and why one package is preferable over another, here’s something you should always remember. As is often the way in programming, there might be multiple solutions for one particular problem. For example, when dealing with shapefiles, you could use pyshp, GDAL, Shapely, or GeoPandas, depending on your preference and the problem at hand. Introduction to Data Analysis and Libraries 15 Useful Python Libraries to make your Data Science tasks Easier “Pandas is an effective tool to explore and analyze data”: An interview with Theodore Petrou Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data

0
0
18677

article-image-alteryx-vs-tableau-choosing-the-right-data-analytics-tool-for-your-business

Guest Contributor

04 Mar 2019

6 min read

Alteryx vs. Tableau: Choosing the right data analytics tool for your business

Guest Contributor

04 Mar 2019

6 min read

Data Visualization is commonly used in the modern world, where most business decisions are taken into consideration by analyzing the data. One of the most significant benefits of data visualization is that it enables us to visually access huge amounts of data in easily understandable visuals. There are many areas where data visualization is being used. Some of the data visualization tools include Tableau, Alteryx, Infogram, ChartBlocks, Datawrapper, Plotly, Visual.ly, etc. Tableau and Alteryx are industry standard tools and have dominated the data analytics market for a few years now and still running strong without any strong competition. In this article, we will understand the core differences between Alteryx tool and Tableau. This will help us in deciding which tool to use for what purposes. Tableau is one of the top-rated tools which helps the analysts to carry out business intelligence and data visualization activities. Using Tableau, the users will be able to generate compelling dashboards and stunning data visualizations. Tableau’s interactive user interface helps users to quickly generate reports where they can drill down the information to a granular level. Alteryx is a powerful tool widely used in data analytics and also provides meaningful insights to the executive level personnel. With the user-friendly interface, the user will be able to extract the data, transform the data, and load the data within the Alteryx tool. Why use Alteryx with Tableau? The use of Alteryx with Tableau is a powerful combination when it comes to getting value-added data decisions. With Alteryx, businesses can manipulate their data and provide input to the Tableau platform, which in return will be able to showcase strong data visualizations. This will help the businesses to take appropriate actions which are backed up with data analysis. Alteryx and Tableau tools are widely used within organizations where the decisions can be taken into consideration based on the insights obtained from data analysis. Talking about data handling, Alteryx is a powerful ETL platform where data can be analyzed in different formats. When it comes to data representation, Tableau is a perfect match. Further, using Tableau the reports can be shared across team members. Nowadays, most of the businesses want to see real-time data and want to understand business trends. The combination of Alteryx and Tableau allows the data analysts to analyze the data, and generate meaningful insights to the users, on-the-fly. Here, data analysis can be executed within the Alteryx tool where the raw data is handled, and then the data representation or visualization is done in Tableau, so both of these tools go hand in hand. Tableau vs Alteryx The table below lists the differences between the tools. Alteryx Tableau This tool is known as a smart data analytics platform. This tool is known for its data visualization capabilities. 2. Can connect with different data sources and can synthesize the raw data. A standard ETL process is possible. 2. Can connect with different data sources and provide data visualization within minutes from the gathered data. 3. Helps in terms of the data analysis 3. Helps in terms of building appealing graphs. 4. The GUI is okay and widely accepted. 4. The GUI is one of the best features where graphs can be easily built by using drag and drop options. 5. Technical knowledge is necessary because it involves in data sources integrations, and also data blending activity. 5. Technical knowledge is not necessary, because all the data will be polished and only the user has to build graphs/visualization. 6. Once the data blending activity is completed, the users will be able to share the file which can be consumed by Tableau. 6. Once the graphs are prepared, the reports can be easily shared among team members without any hassle. 7. A lot of flexibility while using this tool for data blending activity. 7. Flexibility while using the tool for data visualization. 8. Using this tool, the users will be able to do spatial and predictive analysis 8. Possible by representing the data in an appropriate format. 9. One of the best tools when it comes to data preparations. 9. Not feasible to prepare the data in Tableau when it is compared to Alteryx. 10. Data representation cannot be done accurately. 10. It is a wonderful tool for data representation. 11. Has one time feeds- Annual fees 11. Has an option to pay monthly as well. 12. Has a drag and drop interface where the user can develop a workflow easily. 12. Has a drag and drop interface where the user will be able to build a visualization in no time. Alteryx and Tableau Integration As discussed earlier, these two tools have their own advantages and disadvantages, but when integrated together, they can do wonders with the data. This integration between Tableau and Alteryx makes the task of visualizing the Alteryx generated answers quite simple. The data is first loaded into the Alteryx tool and is then extracted in the form of .tde files (i.e. Tableau Data Extracted Files). These .tde files will be consumed by Tableau tool to do the data visualization part. On a regular basis, the data extracted file from Alteryx tool (i.e. .tde files) will be generated and will replace the old .tde files. Thus, by integrating Alteryx and Tableau, we can: Cleanse, combine, as well as collect all the data sources that are relevant and enrich them with the help of third-party data - everything in one workflow. Give analytical context to your data by providing predictive, location-based, and deep spatial analytics. Publish your analytic workflows’ results to Tableau for intuitive, rich visualizations that help you in making decisions more quickly. Tableau and Alteryx do not require any advanced skill-set as both tools have simple drag and drop interfaces. You can create a workflow in Alteryx that can process data in a sequential manner. In a similar way, Tableau enables you to build charts by dragging various fields to be utilized, to specified areas. The companies which have a lot of data to analyze, and can spend large amounts of money on analytics, can use these two tools. There doesn’t exist any significant challenges during Tableau, Alteryx integration. Conclusion When Tableau and Alteryx are used together, it is really useful for the businesses so that the senior management can take decisions based on the data insights provided by these tools. These two tools compliment each other and provide high-quality service to businesses. Author Bio Savaram Ravindra is a Senior Content Contributor at Mindmajix.com. His passion lies in writing articles on different niches, which include some of the most innovative and emerging software technologies, digital marketing, businesses, and so on. By being a guest blogger, he helps his company acquire quality traffic to its website and build its domain name and search engine authority. Before devoting his work full time to the writing profession, he was a programmer analyst at Cognizant Technology Solutions. Follow him on LinkedIn and Twitter. How to share insights using Alteryx Server How to do data storytelling well with Tableau [Video] A tale of two tools: Tableau and Power BI

0
0
10985

article-image-what-does-a-data-science-team-look-like

Fatema Patrawala

21 Nov 2019

11 min read

What does a data science team look like?

Fatema Patrawala

21 Nov 2019

11 min read

Until a couple of years ago, people barely knew the term 'data science' which has now evolved into an extremely popular career field. The Harvard Business Review dubbed data scientist within the data science team as the sexiest job of the 21st century and expert professionals jumped on the data is the new oil bandwagon. As per the Figure Eight Report 2018, which takes the pulse of the data science community in the US, a lot has changed rapidly in the data science field over the years. For the 2018 report, they surveyed approximately 240 data scientists and found out that machine learning projects have multiplied and more and more data is required to power them. Data science and machine learning jobs are LinkedIn's fastest growing jobs. And the internet is creating 2.5 quintillion bytes of data to process and analyze each day. With all these changes, it is evident for data science teams to evolve and change among various organizations. The data science team is responsible for delivering complex projects where system analysis, software engineering, data engineering, and data science is used to deliver the final solution. To achieve all of this, the team does not only have a data scientist or a data analyst but also includes other roles like business analyst, data engineer or architect, and chief data officer. In this post, we will differentiate and discuss various job roles within a data science team, skill sets required and the compensation benefit for each one of them. For an in-depth understanding of data science teams, read the book, Managing Data Science by Kirill Dubovikov, which has interesting case studies on building successful data science teams. He also explores how the team can efficiently manage data science projects through the use of DevOps and ModelOps. Now let's get into understanding individual data science roles and functions, but before that we take a look at the structure of the team.There are three basic team structures to match different stages of AI/ML adoption: IT centric team structure At times for companies hiring a data science team is not an option, and they have to leverage in-house talent. During such situations, they take advantage of the fully functional in-house IT department. The IT team manages functions like data preparation, training models, creating user interfaces, and model deployment within the corporate IT infrastructure. This approach is fairly limited, but it is made practical by MLaaS solutions. Environments like Microsoft Azure or Amazon Web Services (AWS) are equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy. Microsoft Azure, for instance, supports its users with detailed documentation for a low entry threshold. The documentation helps in fast training and early deployment of models even without an expert data scientists on board. Integrated team structure Within the integrated structure, companies have a data science team which focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure for model deployment. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations. Unlike the IT centric approach, the integrated method requires having an experienced data scientist within the team. This approach ensures better operational flexibility in terms of available techniques. Additionally, the team leverages deeper understanding of machine learning tools and libraries – like TensorFlow or Theano which are specifically for researchers and data science experts. Specialized data science team Companies can also have an independent data science department to build an all-encompassing machine learning applications and frameworks. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are handled by a dedicated data science team. It doesn't necessarily mean that all team members should have a data science background, but they should have technology background with certain service management skills. A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML backed services. Today's most successful Silicon Valley tech operates with specialized data science teams. Additionally they are custom-built and wired for specific tasks to achieve different business goals. For example, the team structure at Airbnb is one of the most interesting use cases. Martin Daniel, a data scientist at Airbnb in this talk explains how the team emphasizes on having an experimentation-centric culture and apply machine learning rigorously to address unique product challenges. Job roles and responsibilities within data science team As discussed earlier, there are many roles within a data science team. As per Michael Hochster, Director of Data Science at Stitch Fix, there are two types of data scientists: Type A and Type B. Type A stands for analysis. Individuals involved in Type A are statisticians that make sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc. Type B stands for building. These individuals use data in production. They're good software engineers with strong programming knowledge and statistics background. They build recommendation systems, personalization use cases, etc. Though it is rare that one expert will fit into a single category. But understanding these data science functions can help make sense of the roles described further. Chief data officer/Chief analytics officer The chief data officer (CDO) role has been taking organizations by storm. A recent NewVantage Partners' Big Data Executive Survey 2018 found that 62.5% of Fortune 1000 business and technology decision-makers said their organization appointed a chief data officer. The role of chief data officer involves overseeing a range of data-related functions that may include data management, ensuring data quality and creating data strategy. He or she may also be responsible for data analytics and business intelligence, the process of drawing valuable insights from data. Even though chief data officer and chief analytics officer (CAO) are two distinct roles, it is often handled by the same person. Expert professionals and leaders in analytics also own the data strategy and how a company should treat its data. It does make sense as analytics provide insights and value to the data. Hence, with a CDO+CAO combination companies can take advantage of a good data strategy and proper data management without losing on quality. According to compensation analysis from PayScale, the median chief data officer salary is $177,405 per year, including bonuses and profit share, ranging from $118,427 to $313,791 annually. Skill sets required: Data science and analytics, programming skills, domain expertise, leadership and visionary abilities are required. Data analyst The data analyst role implies proper data collection and interpretation activities. The person in this job role will ensure that collected data is relevant and exhaustive while also interpreting the results of the data analysis. Some companies also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics. As per Indeed, the average salary for a data analyst is $68,195 per year in the United States. Skill sets required: Programming languages like R, Python, JavaScript, C/C++, SQL. With this critical thinking, data visualization and presentation skills will be good to have. Data scientist Data scientists are data experts who have the technical skills to solve complex problems and the curiosity to explore what problems are needed to be solved. A data scientist is an individual who develops machine learning models to make predictions and is well versed in algorithm development and computer science. This person will also know the complete lifecycle of the model development. A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets. According to Glassdoor, the current U.S. average salary for a data scientist is $118,709. Skills set required: A data scientist will require knowledge of big data platforms and tools like Seahorse powered by Apache Spark, JupyterLab, TensorFlow and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R. They should also have cloud computing capabilities and knowledge of various cloud platforms like AWS, Microsoft Azure etc.You can also read this post on how to ace a data science interview to know more. Machine learning engineer At times a data scientist is confused with machine learning engineers, but a machine learning engineer is a distinct role that involves different responsibilities. A machine learning engineer is someone who is responsible for combining software engineering and machine modeling skills. This person determines which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is the ML engineer's job. The average machine learning engineer's salary is $146,085 in the US, and is ranked No.1 on the Indeed's Best Jobs in 2019 list. Skill sets required: Machine learning engineers will be required to have expertise in computer science and programming languages like R, Python, Scala, Java etc. They would also be required to have probability techniques, data modelling and evaluation techniques. Data architects and data engineers The data architects and data engineers work in tandem to conceptualize, visualize, and build an enterprise data management framework. The data architect visualizes the complete framework to create a blueprint, which the data engineer can use to build a digital framework. The data engineering role has recently evolved from the traditional software-engineering field. Recent enterprise data management experiments indicate that the data-focused software engineers are needed to work along with the data architects to build a strong data architecture. Average salary for a data architect in the US ranges from $1,22,000 to $1,29, 000 annually as per a recent LinkedIn survey. Skill sets required: A data architect or an engineer should have a keen interest and experience in programming languages frameworks like HTML5, RESTful services, Spark, Python, Hive, Kafka, and CSS etc. They should have the required knowledge and experience to handle database technologies such as PostgreSQL, MapReduce and MongoDB and visualization platforms such as; Tableau, Spotfire etc. Business analyst A business analyst (BA) basically handles Chief analytics officer's role but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst can bridge the gap. They are responsible for using data analytics to assess processes, determine requirements and deliver data-driven recommendations and reports to executives and stakeholders. BAs engage with business leaders and users to understand how data-driven changes will be implemented to processes, products, services, software and hardware. They further articulate these ideas and balance them against technologically feasible and financially reasonable. The average salary for a business analyst is $75,078 per year in the United States, as per Indeed. Skill sets required: Excellent domain and industry expertise will be required. With this good communication as well as data visualization skills and knowledge of business intelligence tools will be good to have. Data visualization engineer This specific role is not present in each of the data science teams as some of the responsibilities are realized by either a data analyst or a data architect. Hence, this role is only necessary for a specialized data science model. The role of a data visualization engineer involves having a solid understanding of UI development to create custom data visualization elements for your stakeholders. Regardless of the technology, successful data visualization engineers have to understand principles of design, both graphical and more generally user-centered design. As per Payscale, the average salary for a data visualization engineer is $98,264. Skill sets required: A data visualization engineer need to have rigorous knowledge of data visualization methods and be able to produce various charts and graphs to represent data. Additionally they must understand the fundamentals of design principles and visual display of information. To sum it up, a data science team has evolved to create a number of job roles and opportunities, but companies still face challenges in building up the team from scratch and find it hard to figure where to start from. If you are facing a similar dilemma, check out this book, Managing Data Science, written by Kirill Dubovikov. It covers concepts and methodologies to manage and deliver top-notch data science solutions, while also providing guidance on hiring, growing and sustaining a successful data science team. How to learn data science: from data mining to machine learning How to ace a data science interview Data science vs. machine learning: understanding the difference and what it means today 30 common data science terms explained 9 Data Science Myths Debunked

0
0
10189

article-image-how-everyone-at-netflix-uses-jupyter-notebooks-from-data-scientists-machine-learning-engineers-to-data-analysts

Bhagyashree R

18 Aug 2018

4 min read

How everyone at Netflix uses Jupyter notebooks from data scientists, machine learning engineers, to data analysts

Bhagyashree R

18 Aug 2018

4 min read

Netflix uses a variety of tools to do data analysis. One of the big ways that data scientists and engineers at Netflix interact with their data is through Jupyter notebooks. In addition to providing execution environments to users, Netflix invests in various parts of the Jupyter ecosystem and tooling. They are “reimagining what a notebook can be, who can use it, and what they can do with it.” Netflix aims to provide personalized content to their 130 million viewers. For this every day more than 1 trillion events are written into a streaming ingestion pipeline. To support this, they’ve built an industry-leading data platform which is flexible, powerful, and complex. There are so many diverse users of this platform, such as analytics engineers, data engineers, and data scientists, requiring different sets of tools and languages. To help the platform scale, they wanted to minimize the number of tools and the solution to this was the open-source tool: Jupyter notebooks. Why Jupyter notebook is so compelling for Netflix? These are the functionalities provided by notebook that benefits Netflix’s data scientists and engineers: Standard messaging API: The Jupyter protocol provides a standard messaging API with the kernels that act as computational engines. It separates where the content is written and where the content is executed. This makes it language agnostic. Editable file format: It provides an editable file format that stores the code and results together. Web-based UI: It is web-based which helps interactively writing and running code as well as visualizing outputs. How Netflix uses Jupyter Notebooks? The following are some of the use cases they use Jupyter notebooks for: Data access: Notebooks were first introduced for workflows and their adoption grew among the data scientists. Seeing this, Netflix decided to leverage its versatility and architecture for general data access. Notebooks provide an user-friendly interface for interactively running code, exploring the outputs, and visualizing data all from a single cloud-based development environment. Notebook Templates: They introduced parameterized notebooks, which allow the use of parameters in the code and take values as input at runtime. These templates help: Data scientists to run an experiment with different coefficients and summarize the results Data engineers to execute data quality audits Data analysts to share prepared queries and visualizations Software engineers to email the results of a troubleshooting script Scheduling notebooks: Next they are using notebooks for creating a unifying layer for scheduling workflows. Notebooks are used for interactive work and allows smooth move to scheduling that work to run recurrently. Many users create an entire workflow in a notebook and just copy/paste it into separate files for scheduling when they’re ready to deploy it. Notebook infrastructure: The three fundamental components of the infrastructure are: storage, compute, and interface. Source: Netflix Tech Blog Storage: The Netflix Data Platform is made of Amazon S3 and EFS for cloud storage, which notebooks treat as virtual filesystems. Each user has a home directory on EFS containing a personal workspace for notebooks. This workspace is for storing any notebook created or uploaded by a user. When a user launches a notebook interactively, all the reading and writing happens at the workspace. Compute: All the jobs on the data platform run on containers including queries, pipelines and notebooks. A container with reasonable default resources is provisioned when a user launches a notebook. Users can request more resources if they find that the provided resources are not enough. A unified execution environment with a prepared container image is provided, which has common libraries and an array of default kernels preinstalled. The orchestration and environments are managed with Titus, their container management platform. Interface: They are using nteract, a React-based frontend for Jupyter notebooks, which emphasizes simplicity and composability as core design principles.They’re also introducing native support for parameterization, which makes it easier to schedule notebooks and create reusable templates. Netflix is planning to make investments in both the frontend and backend to improve the overall notebook experience. This year they are also sponsoring JupyterCon. To read more about how Jupyter is offering value to Netflix read Netflix’s original post at Medium. 10 reasons why data scientists love Jupyter notebooks What’s new in Jupyter Notebook 5.3.0 Netflix open sources Zuul 2 cloud gateway

0
0
8648

article-image-top-8-ways-to-improve-your-data-visualizations

Natasha Mathur

04 Jul 2018

7 min read

8 ways to improve your data visualizations

Natasha Mathur

04 Jul 2018

7 min read

In Dr. W.Edwards Deming’s words “In God we trust, all others must bring data”. Organizations worldwide, revolve around data like planets revolve around the sun. Since data is so central to organizations, there are certain data visualization tools that help them understand data to make better business decisions. A lot more data is getting churned out and collected by organizations than ever before. So, how to make sense of all this data? Humans are visual creatures and our human brain processes visual information far better than textual information. In fact, presentations that use visual aids such as colors, shapes, images, etc, are found to be far more persuasive according to a research done by University of Minnesota back in 1986. Data visualization is one such process that easily translates the collected information into engaging visuals. It’s easy, cheap and doesn’t require any designing expertise to create data visuals. However, some professionals feel that data visualization is just limited to slapping on charts and graphs when that’s not actually the case. Data visualization is about conveying the right information, in a way that enhances the audience’s experience. So, if you want your graphs and charts to be more succinct and understandable, here are eight ways to improve your data visualization process: 1. Get rid of unneeded information Less is more in some cases and the same goes for data visualization. Using excessive color, jargons, pie charts and metrics take away focus from the important information. For instance, when using colors, don’t make your charts and graphs a rainbow instead use a specific set of colors with a clear purpose and meaning. Do you see the difference color and chart make to visualization in the below images? Source: Podio Similarly, when it comes to expressing your data, note how people interact at your workplace. Keep the tone of your visuals as natural as possible to make it easy for the audience to interpret your data. For metrics, only show the ones that truly bring value to your storytelling. Filter out the ones that are not so important to create less fuss. Tread cautiously while using pie charts as they can be difficult to understand sometimes and also, get rid of the elements on a chart that cause unnecessary confusion. Source: Dashboard Zone 2. Use conditional formatting for tabular data Data visualization doesn’t need to use fancy tools or designs. Take your standard excel table for example. Do you want to point out patterns or outliers in your data? Conditional formatting is a great tool for people working with data. It involves making simple rules on a given data and once that’s done, it’ll highlight only the data that matters the most to you. This helps quickly track the main information. Conditional formatting can be used for different things. It can help spot duplicate data in your table. You need to set bounds for the data using the built-in conditional formatting. It’ll then format the cells based on those bounds, highlighting the data you want. For instance, if sales quota of over 65% is good, between 65% and 55% is average, and below 55% is poor, then with conditional formatting, you can quickly find out who is meeting the expected sales quota, and who is not. 3. Add trendlines to unearth patterns for prediction Another feature that can amp up your data visualization is trendlines. They observe the relationship between two variables from your existing data. They are also are useful for predicting future values. Trendlines are simple to add and help discover trends in the given data set. Source: Interworks It also show data trends or moving averages in your charts. Depending on the kind of data you’re working with, there are a number of trendlines out there that you can use on your visualizations. Questions like whether a new strategy seems to be working in favor of the organization can be answered with the help of trendlines. This insight, in turn, helps predict new outcomes for the future. Statistical models are used in trendlines to make predictions. Once you add trend lines to a view, it’s up to you to decide how you want them to look and behave. 4. Implement filter by rule to get more specific Filter helps display just the information that you need. Using filter by rule, you can add filter option to your dataset. Organizations produce huge amounts of data on a regular basis. Suppose you want to know which employees within your organization are consistent performers. So, instead of creating a visualization that includes all the employees and their performances, you can filter it down, so that it shows only the employees who are always doing well. Similarly, if you want to find out which day the sales went up or down, you can filter it to show results for only the past week or month depending upon your preference. 5. For complex or dense data representation, add hierarchy Hierarchies eliminate the need to create extra visualizations. You can view data from a high level and dig deeper into the specifics of the data as you come up with questions based on the data. Adding a hierarchy to the data helps club multiple information in one visualization. Source: dzone For instance, if you create a hierarchy that shows the total sales achieved by different sales representative within an organization in the past month. Now, you can further break this down by selecting a particular sales rep, and then you can go even further by selecting a specific product assigned to that sales rep. This cuts down on a lot of extra work. 6. Make visuals more appealing by formatting data Data formatting takes only a few seconds but it can make a huge difference when it comes to the audience interpreting your data. Source: dzone It makes the numbers appear more visually appealing and easier to read for the audience. It can be used for charts such as bar charts and column charts. Formatting data to show a certain number of decimals, comma separators, number font, currency or percentage can make your visualization process more engaging. 7. Include comparison for more insight Comparisons provide readers a better perspective on data. It can both improve and add insights to your visualizations by including comparisons to your charts. For instance, in case you want to inform your audience about organization’s growth in current as well as the past year then you can include comparison within the visualization. You can also use a comparison chart to compare between two data points such as budget vs actually spent. 8. Sort data to improve readability Again, sorting through data is a great way to make things easy for the audience when dealing with huge quantities of data. For instance, if you want to include information about the highest and lowest performing products, you can sort your data. Sorting can be done in the following ways: Ascending - This helps sort the data from lowest to highest. Descending - This sorts data from highest to lowest. Data source order - Sorts the data in the order it is sorted in the data source. Alphabetic - Data is alphabetically sorted. Manual - Data can be sorted manually in the order you prefer. Effective data visualization helps people interpret the information in data that could not be seen before, to change their minds and prompt action. These were some of the tricks and features to take your data visualization game to the next level. There are different data visualization tools available in the market to choose from. Tableau and Microsoft Power BI are among the top ones that offer great features for data visualization. So, now that we’ve got you covered with some of the best practices for data visualization, it’s your turn to put these tips to practice and create some strong visual data stories. Do you have any DataViz tips to share with our readers? Please add them in the comments below. Getting started with Data Visualization in Tableau What is Seaborn and why should you use it for data visualization? “Tableau is the most powerful and secure end-to-end analytics platform”: An interview with Joshua Milligan

0
0
7727

article-image-5-best-practices-to-perform-data-wrangling-with-python

Savia Lobo

18 Oct 2018

5 min read

5 best practices to perform data wrangling with Python

Savia Lobo

18 Oct 2018

5 min read

Data wrangling is the process of cleaning and structuring complex data sets for easy analysis and making speedy decisions in less time. Due to the internet explosion and the huge trove of IoT devices there is a massive availability of data, at present. However, this data is most often in its raw form and includes a lot of noise in the form of unnecessary data, broken data, and so on. Clean up of this data is essential in order to use it for analysis by organizations. Data wrangling plays a very important role here by cleaning this data and making it fit for analysis. Also, Python language has built-in features to apply any wrangling methods to various data sets to achieve the analytical goal. Here are 5 best practices that will help you out in your data wrangling journey with the help of Python. And at the end, all you’ll have is a clean and ready to use data for your business needs. 5 best practices for data wrangling with Python Learn the data structures in Python really well Designed to be a very high-level language, Python offers an array of amazing data structures with great built-in methods. Having a solid grasp of all the capabilities will be a potent weapon in your repertoire for handling data wrangling task. For example, dictionary in Python can act almost like a mini in-memory database with key-value pairs. It supports extremely fast retrieval and search by utilizing a hash table underneath. Explore other built-in libraries related to these data structures e.g. ordered dict, string library for advanced functions. Build your own version of essential data structures like stack, queues, heaps, and trees, using classes and basic structures and keep them handy for quick data retrieval and traversal. Learn and practice file and OS handling in Python How to open and manipulate files How to manipulate and navigate directory structure Have a solid understanding of core data types and capabilities of Numpy and Pandas How to create, access, sort, and search a Numpy array. Always think if you can replace a conventional list traversal (for loop) with a vectorized operation. This will increase speed of your data operation. Explore special file types like .npy (Numpy’s native storage) to access/read large data set with much higher speed than usual list. Know in details all the file types you can read using built-in Pandas methods. This will simplify to a great extent your data scraping. Almost all of these methods have great data cleaning and other checks built in. Try to use such optimized routines instead of writing your own to speed up the process. Build a good understanding of basic statistical tests and a panache for visualization Running some standard statistical tests can quickly give you an idea about the quality of the data you need to wrangle with. Plot data often even if it is multi-dimensional. Do not try to create fancy 3D plots. Learn to explore simple set of pairwise scatter plots. Use boxplots often to see the spread and range of the data and detect outliers. For time-series data, learn basic concepts of ARIMA modeling to check the sanity of the data Apart from Python, if you want to master one language, go for SQL As a data engineer, you will inevitably run across situations where you have to read from a large, conventional database storage. Even if you use Python interface to access such database, it is always a good idea to know basic concepts of database management and relational algebra. This knowledge will help you build on later and move into the world of Big Data and Massive Data Mining (technologies like Hadoop/Pig/Hive/Impala) easily. Your basic data wrangling knowledge will surely help you deal with such scenarios. Although Data wrangling may be the most time-consuming process, it is the most important part of the data management. Data collected by businesses on a daily basis can help them make decisions on the latest information available. It also allows businesses to find the hidden insights and use it in the decision-making processes and provide them with new analytic initiatives, improved reporting efficiency and much more. About the authors Dr. Tirthajyoti Sarkar works in San Francisco Bay area as a senior semiconductor technologist where he designs state-of-the-art power management products and applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He has 15+ years of R&D experience and is a senior member of IEEE. Shubhadeep Roychowdhury works as a Sr. Software Engineer at a Paris based Cyber Security startup where he is applying the state-of-the-art Computer Vision and Data Engineering algorithms and tools to develop cutting edge product. Data cleaning is the worst part of data analysis, say data scientists Python, Tensorflow, Excel and more – Data professionals reveal their top tools Manipulating text data using Python Regular Expressions (regex)

0
0
7171

Aarthi Kumaraswamy

16 May 2018

27 min read

30 common data science terms explained

Aarthi Kumaraswamy

16 May 2018

27 min read

0
0
6802

article-image-9-data-science-myths-debunked

Amey Varangaonkar

03 Jul 2018

9 min read

9 Data Science Myths Debunked

Amey Varangaonkar

03 Jul 2018

9 min read

The benefits of data science are evident for all to see. Not only does it equip you with the tools and techniques to make better business decisions, the predictive power of analytics also allows you to determine future outcomes - something that can prove to be crucial to businesses. Despite all these advantages, data science is a touchy topic for many businesses. It’s worth looking at some glaring stats that show why businesses are reluctant to adopt data science: Poor data across businesses and organizations - in both private and government costs the U.S economy close to $3 Trillion per year. Only 29% enterprises are able to properly leverage the power of Big Data and derive useful business value from it. These stats show a general lack of awareness or knowledge when it comes to data science. It could be due to some preconceived notions, or simply lack of knowledge and its application that seems to be a huge hurdle to these companies. In this article, we attempt to take down some of these notions and give a much clearer picture of what data science really is. Here are 5 of the most common myths or misconceptions in data science, and why are absolutely wrong: Data Science is just a fad, it won’t last long This is probably the most common misconception. Many tend to forget that although ‘data science’ is a recently coined term, this field of study is a cumulation of decades of research and innovation in statistical methodologies and tools. It has been in use since the 1960s or even before - just that the scale at which it was being used then was small. Back in the day, there were no ‘data scientists’, but just statisticians and economists who used the now unknown terms such as ‘data fishing’ or ‘data dredging’. Even the terms ‘data analysis’ and ‘data mining’ only went mainstream in the 1990s, but they were in use way before that period. Data Science’s rise to fame has coincided with the exponential rise in the amount of data being generated every minute. The need to understand this information and make positive use of it led to an increase in the demand for data science. Now with Big Data and Internet of Things going wild, the rate of data generation and the subsequent need for its analysis will only increase. So if you think data science is a fad that will go away soon, think again. Data Science and Business Intelligence are the same Those who are unfamiliar with what data science and Business Intelligence actually entail often get confused, and think they’re one and the same. No, they’re not. Business Intelligence is an umbrella term for the tools and techniques that give answers to the operational and contextual aspects of your business or organization. Data science, on the other hand has more to do with collecting information in order to build patterns and insights. Learning about your customers or your audience is Business Intelligence. Understanding why something happened, or whether it will happen again, is data science. If you want to gauge how changing a certain process will affect your business, data science - not Business Intelligence - is what will help you. Data Science is only meant for large organizations with large resources Many businesses and entrepreneurs are wrongly of the opinion that data science is - or works best - only for large organizations. It is a wrongly perceived notion that you need sophisticated infrastructure to process and get the most value out of your data. In reality, all you need is a bunch of smart people who know how to get the best value of the available data. When it comes to taking a data-driven approach, there’s no need to invest a fortune in setting up an analytics infrastructure for an organization of any scale. There are many open source tools out there which can be easily leveraged to process large-scale data with efficiency and accuracy. All you need is a good understanding of the tools. It is difficult to integrate data science systems with the organizational workflow With the advancement of tech, one critical challenge that has now become very easy to overcome is to collaborate with different software systems at once. With the rise of general-purpose programming languages, it is now possible to build a variety of software systems using a single programming language. Take Python for example. You can use it to analyze your data, perform machine learning or develop neural networks to work on more complex data models. All this while, you can link your web API designed in Python to communicate with these data science systems. There are provisions being made now to also integrate codes written in different programming languages while ensuring smooth interoperability and no loss of latency. So if you’re wondering how to incorporate your analytics workflow in your organizational workflow, don’t worry too much. Data Scientists will be replaced by Artificial Intelligence soon Although there has been an increased adoption of automation in data science, the notion that the work of a data scientist will be taken over by an AI algorithm soon is rather interesting. Currently, there is an acute shortage of data scientists, as this McKinsey Global Report suggests. Could this change in the future? Will automation completely replace human efforts when it comes to data science? Surely machines are a lot better than humans at finding patterns; AI best the best go player, remember. This is what the common perception seems to be, but it is not true. However sophisticated the algorithms become in automating data science tasks, we will always need a capable data scientist to oversee them and fine-tune their performance. Not just that, businesses will always need professionals with strong analytical and problem solving skills with relevant domain knowledge. They will always need someone to communicate the insights coming out of the analysis to non-technical stakeholders. Machines don’t ask questions of data. Machines don’t convince people. Machines don’t understand the ‘why’. Machines don’t have intuition. At least, not yet. Data scientists are here to stay, and their demand is not expected to go down anytime soon. You need a Ph.D. in statistics to be a data scientist No, you don’t. Data science involves crunching numbers to get interesting insights, and it often involves the use of statistics to better understand the results. When it comes to performing some advanced tasks such as machine learning and deep learning, sure, an advanced knowledge of statistics helps. But that does not imply that people who do not have a degree in maths or statistics cannot become expert data scientists. Today, organizations are facing a severe shortage of data professionals capable of leveraging the data to get useful business insights. This has led to the rise of citizen data scientists - meaning professionals who are not experts in data science, but can use the data science tools and techniques to create efficient data models. These data scientists are no experts in statistics and maths, they just know the tool inside out, ask the right questions, and have the necessary knowledge of turning data into insights. Having an expertise of the data science tools is enough Many people wrongly think that learning a statistical tool such as SAS, or mastering Python and its associated data science libraries is enough to get the data scientist tag. While learning a tool or skill is always helpful (and also essential), by no means is it the only requisite to do effective data science. One needs to go beyond the tools and also master skills such as non-intuitive thinking, problem-solving, and knowing the correct practical applications of a tool to tackle any given business problem. Not just that, it requires you to have excellent communication skills to present your insights and findings related to the most complex of analysis to other stakeholders, in a way they can easily understand and interpret. So if you think that a SAS certification is enough to get you a high-paying data science job and keep it, think again. You need to have access to a lot of data to get useful insights Many small to medium-sized businesses don’t adopt a data science framework because they think it takes lots and lots of data to be able to use the analytics tools and techniques. Data when present in bulk, always helps, true, but don’t need hundreds of thousands of records to identify some pattern, or to extract relevant insights. Per IBM, data science is defined by the 4 Vs of data, meaning Volume, Velocity, Veracity and Variety. If you are able to model your existing data into one of these formats, it automatically becomes useful and valuable. Volume is important to an extent, but it’s the other three parameters that add the required quality. More data = more accuracy Many businesses collect large hordes of information and use the modern tools and frameworks available at their disposal for analyzing this data. Unfortunately, this does not always guarantee accurate results. Neither does it guarantee useful actionable insights or more value. Once the data is collected, the preliminary analysis on what needs to be done with the data is required. Then, we use the tools and frameworks at our disposal to extract the relevant insights and built an appropriate data model. These models need to be fine-tuned as per the processes for which they will be used. Then, eventually, we get the desired degree of accuracy from the model. Data in itself is quite useless. It’s how we work on it - more precisely, how effectively we work on it - that makes all the difference. So there you have it! Data science is one of the most popular skills to have in your resume today, but it is important to first clear all the confusions and misconceptions that you may have about it. Lack of information or misinformation can do more harm than good, when it comes to leveraging the power of data science within a business - especially considering it could prove to be a differentiating factor for its success and failure. Do you agree with our list? Do you think there are any other commonly observed myths around data science that we may have missed? Let us know. Read more 30 common data science terms explained Why is data science important? 15 Useful Python Libraries to make your Data Science tasks Easier

0
0
6578

article-image-ubers-kepler-gl-an-open-source-toolbox-for-geospatial-analysis

Pravin Dhandre

28 Jun 2018

4 min read

Uber's kepler.gl, an open source toolbox for GeoSpatial Analysis

Pravin Dhandre

28 Jun 2018

4 min read

Geography Visualization, also called as Geovisualization plays a pivotal role in areas like cartography, geographic information systems, remote sensing and global positioning systems. Uber, a peer-to-peer transportation network company headquartered at California believes in data-driven decision making and hence keeps developing smart frameworks like deck.gl for exploring and visualizing advanced geospatial data at scale. Uber strives to make the data web-based and shareable in real-time across their teams and customers. Early this month, Uber surprised the geospatial market with its newly open-source toolbox, kepler.gl, a geoanalytics tool to gain quick insights from geospatial data with amazing and intuitive visualizations. What’s exactly Kepler.gl is? kepler.gl is a visualization-rich web platform, developed on top of deck.gl, a WebGL-powered data visualization library providing real-time visual analytics of millions of geolocation points. The platform provides visual exploration of geographical data sets along with spatial aggregation of all data points collected. The platform is said to be data-agnostic with a single interface to convert your data into insightful visualizations. https://www.youtube.com/watch?v=i2fRN4e2s0A The platform is very user-friendly where one can just drag the CSV or the GeoJSON files and drop them into the browser to visualize the dataset more intuitively. The platform is supported with different map layers, filtering option, aggregation feature through which you can get the final visualization in an animated format or like a video. The usability of features is so high that you can apply all the metrics available to your data points without much of a hassle. The web platform exhibits high performance where you can get insights from your spatial data in less than 10 minutes and that too in a single window. Another advantage of this framework is it does not involve any sort of coding and hence non-technical users can also reap the benefits by churn valuable insights from the data points. The platform is also equipped with some advanced, complex features such as 2D cartographic plane,a separate dimension for altitude, visibility of height of hexagon and grids. The users seem happy with the new height feature which helps them detect abnormalities and illicit traits in an aggregated map. With the filtering menu, the analysts and engineers can compare their data and have a granular look at their data points. This option also helps in reading the histogram well and one can easily detect outliers and make their dataset more reliable. It has a feature to add playback to time series data points which makes getting useful information of real time location systems easy. The team at Uber looks at this toolbox with a long-term vision where they are planning to keep adding new features and enhancements to make it highly functional and a single-click visualization dashboard. The team has already announced that they would be powering it up with two major enhancements to the current functionality in next couple of months. They would add support on, More robust exploration: There will be interlinkage between charts and maps, and support for custom charts, maps and widgets like the renowned BI tool Tableau through which it will facilitate analytics teams to unveil deeper insights. Addition of newer geo-analytical capabilities: To support massive datasets, there will be added features on data operations such as polygon aggregation, union of data points, operations like joining and buffering. Companies across different verticals such as Airbnb, Atkins Global, Cityswifter, Mapbox have found great value in kepler.gl offerings and are looking towards engineering their products to leverage this framework. The visualization specialists at these companies have already praised Uber for building such a simple yet fast platform with remarkable capabilities. To get started with kepler.gl, read the documentation available at Github and start creating visualizations and enhance your geospatial data analysis. Top 7 libraries for geospatial analysis Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data Data Visualization with ggplot2

0
0
5890

article-image-python-tensorflow-excel-and-more-data-professionals-reveal-their-top-tools

Amey Varangaonkar

06 Jun 2018

4 min read

Python, Tensorflow, Excel and more - Data professionals reveal their top tools

Amey Varangaonkar

06 Jun 2018

4 min read

Data professionals are constantly on the lookout for the best tools to simplify their data science tasks - be it data acquisition, machine learning, or visualizing the results of the analysis. With so much on their plate already, having robust, efficient tools in the arsenal helps them a lot in reducing the procedural complexities. Not just that, the time taken to do these tasks is considerably reduced as well. But what tools do data professionals rely on to make their lives easier? Thanks to the Skill-up 2018 survey that we recently conducted, we have some interesting observations to share with you! Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. Key Takeaways Python is the most widely used programming language by data professionals Python finds a wide adoption across all spectrums of data science - including data analysis, machine learning, deep learning and data visualization Excel continues to be favored by the data professionals because of its effectiveness and simplicity R is slowly falling behind Python in the race to Data Science supremacy Now, let’s look at these observations, in more depth. Python continues its ascension as the top dog Python’s rise in popularity as well as adoption over the last 3 years has been quite staggering, to say the least. Python’s ease of use, powerful analytical and machine learning capabilities as well as its applications outside of data science make it quite a popular language in the tech community. It thus comes as no surprise that it stood out from the others and was the undisputed choice of language for the data pros. R, on the other hand, seems to be finding it difficult to play catch-up to Python, with less than half the number of votes - despite being the tool of choice for many statisticians and researchers. Is the paradigm shift well and truly on? Is Python edging R out for good? Source: Packt Skill-Up Survey 2018 It is interesting to see SQL as the number 2, but considering the number of people working with databases these days it doesn’t come as a surprise. Also, JavaScript is preferred more than Java, indicating the rising need for web-based dashboards for effective Business Intelligence. Data professionals still love Excel, but Python libraries are taking over Microsoft Excel has traditionally been a highly popular tool for data analysis, especially when dealing with data with hundreds and thousands of records. Excel’s perfect setting for data manipulation and charting continues to be the reason why people still use it for basic-level data analysis, as indicated by our survey. Almost 53% of the respondents prefer having Excel in their analysis toolkit for their day to day tasks. Top libraries, tools and frameworks used by data professionals (Source: Packt Skill-Up Survey 2018) The survey also indicated Python’s rising dominance in the data science domain, with 8 out of the 10 most-used tools for data analysis being Python-based. Python’s offerings for data wrangling, scientific computing, machine learning and deep learning make its libraries the obvious choice for data professionals. Here’s a quick look at 15 useful Python libraries to make the above-mentioned data science tasks easier. Tensorflow and PyTorch are in demand AI’s popularity is soaring with every passing day as it finds applications across all types of industries and business domains. In our survey, we found machine learning and deep learning to be two of the most valuable skills to have for any data scientist, as can be seen from the word cloud below: Word cloud for the most valued skills by data professionals (Source: Packt Skill-Up Survey) Python’s two popular deep learning frameworks - Tensorflow and PyTorch have thus gained a lot of attention and adoption in the recent times. Along with Keras - another Python library - these two libraries are the most used frameworks used by data scientists and ML developers for building efficient machine learning and deep learning models. Which language/libraries do you use for your everyday Data Science tasks? Do you agree with your peers’ choice of tools? Feel free to let us know! Read more Data cleaning is the worst part of data analysis, say data scientists 30 common data science terms explained Top 10 deep learning frameworks

0
0
5833

article-image-messaging-app-telegram-updated-privacy-policy-open-challenge

Amarabha Banerjee

08 Sep 2018

7 min read

Messaging app Telegram's updated Privacy Policy is an open challenge

Amarabha Banerjee

08 Sep 2018

7 min read

Social media companies are facing a lot of heat presently because of their privacy issues. One of them is Facebook. The Cambridge analytica scandal had even prompted a senate hearing for Mark Zuckerberg. On the other end of this spectrum, there is another messaging app known as Telegram, registered in London, United Kingdom, founded by the Russian entrepreneur Pavel Durov. Telegram has been in the news for an absolutely opposite situation. It’s often touted as one of the most secure and secretive messaging apps. The end to end encryption ensures that security agencies across the world have a tough time getting access to any suspicious piece of information. For this reason Russia has banned the use of Telegram app on April 2018. Telegram updated their privacy policies on . These updates have further ensured that Telegram will retain the title of the most secure messaging application in the planet. It’s imperative for any messaging app to get access to our data. But how they choose to use it makes you either vulnerable or secure. Telegram in their latest update have stated that they process personal data on the grounds that such processing caters to the following two goals: Providing effective and innovative Services to our users To detect, prevent or otherwise address fraud or security issues in respect of their provision of Services. The caveat for the second point being the security interests shall not override the space of fundamental rights and freedoms that require protection of personal data. This clause is an excellent example on how applications can prove to be a torchbearer for human rights and basic human privacy amidst glaring loopholes. Telegram have listed the the kind of user data accessed by the app. They are as follows: Basic Account Data Telegram stores basic account user data that includes mobile number, profile name, profile picture and about information, which are needed to create a Telegram account. The most interesting part of this is Telegram allows you to only keep your username (if you choose to) public. The people who have you in their contact list will see you as you want them to - for example you might be a John Doe in public, but your mom will still see you as ‘Dear Son’ in their contacts. Telegram doesn’t require your real name, gender, age or even your screen name to be your real name. E-mail Address When you enable 2-step-verification for your account or store documents using the Telegram Passport feature, you can opt to set up a password recovery email. This address will only be used to send you a password recovery code if you forget it. They are particular about not sending any unsolicited marketing emails to you. Personal Messages Cloud Chats Telegram stores messages, photos, videos and documents from your cloud chats on their servers so that you can access your data from any of your devices anytime without having to rely on third-party backups. All data is stored heavily encrypted and the encryption keys in each case are stored in several other data centers in different jurisdictions. This way local engineers or physical intruders cannot get access to user data. Secret Chats Telegram has a feature called Secret chats that uses end-to-end encryption. This means that all data is encrypted with a key that only the sender and the recipients know. There is no way for us or anybody else without direct access to your device to learn what content is being sent in those messages. Telegram does not store ‘secret chats’ on their servers. They also do not keep any logs for messages in secret chats, so after a short period of time there is no way of determining who or when you messaged via secret chats. Secret chats are not available in the cloud — you can only access those messages from the device they were sent to or from. Media in Secret Chats When you send photos, videos or files via secret chats, before being uploaded, each item is encrypted with a separate key, not known to the server. This key and the file’s location are then encrypted again, this time with the secret chat’s key — and sent to your recipient. They can then download and decipher the file. This means that the file is technically on one of Telegram’s servers, but it looks like a piece of random indecipherable garbage to everyone except for you and the recipient. This complete process is random and there random data packets are periodically purged from the storage disks too. Public Chats In addition to private messages, Telegram also supports public channels and public groups. All public chats are cloud chats. Like everything else on Telegram, the data you post in public communities is encrypted, both in storage and in transit — but everything you post in public will be accessible to everyone. Phone Number and Contacts Telegram uses phone numbers as unique identifiers so that it is easy for you to switch from SMS and other messaging apps and retain your social graph. But the most important thing is that permissions from the users are a must before the cookies are allowed into your browser. Cookies Telegram promises that the only cookies they use are those to operate and provide their Services on the web. They clearly state that they don’t use cookies for profiling or advertising. Their cookies are small text files that allow them to provide and customize their Services, and provide an enhanced user experience. Also, whether or not to use these cookies is a choice made by the users. So, how does Telegram remain in business? The Telegram business model doesn’t match that of a revenue generating service. The founder Pavel Durov is also the founder of the popular Russian social networking site VK. Telegram doesn’t charge for any messaging services, it doesn’t show ads yet. Some new in app purchase features might be included in the new version. As of now, the main source of revenue for Telegram are donations and mainly the earnings of Pavel Durov himself (from the social networking site VK). What can social networks learn from Telegram? Telegram’s policies elevate privacy standards that many are asking from other social messaging apps. The clamour for stopping the exploitation of user data, using their location details for targeted marketing and advertising campaigns is increasing now. Telegram shows that privacy can be achieved, if intended, in today’s overexposed social media world. But there is are also costs to this level of user privacy and secrecy, that are sometimes not discussed enough. The ISIS members behind the 2015 Paris attacks used Telegram to spread propaganda. ISIS also used the app to recruit the perpetrators of the Christmas market attack in Berlin last year and claimed credit for the massacre. More recently, a Turkish prosecutor found that the shooter behind the New Year’s Eve attack at the Reina nightclub in Istanbul used Telegram to receive directions for it from an ISIS leader in Raqqa. While these incidents can never negate the need for a secure and less intrusive social media platform like Telegram, there should be workarounds and escape routes designed for stopping extremists and terrorist activities. Telegram have assured that all ISIS messaging channels are deleted from their network which is a great way to start. Content moderation, proactive sentiment and pattern recognition and content/account isolation are the next challenges for Telegram. One thing is for sure, Telegram’s continual pursuance of user secrey and user data privacy is throwing an open challenge to others to follow suite. Whether others will oblige or not, only time will tell. To read about Telegram’s updated privacy policies in detail, you can check out the official Telegram Privacy Settings. How to stay safe while using Social Media Time for Facebook, Twitter and other social media to take responsibility or face regulation What RESTful APIs can do for Cloud, IoT, social media and other emerging technologies

0
0
5231

article-image-analyzing-the-chief-data-officer-role

Aaron Lazar

13 Apr 2018

9 min read

GDPR is pushing the Chief Data Officer role center stage

Aaron Lazar

13 Apr 2018

9 min read

0
0
5008

article-image-data-science-for-non-techies-how-i-got-started

Amey Varangaonkar

20 Jul 2018

7 min read

Data science for non-techies: How I got started (Part 1)

Amey Varangaonkar

20 Jul 2018

7 min read

As a category manager, I manage the data science portfolio of product ideas for Packt Publishing, a leading tech publisher. In simple terms, I place informed bets on where to invest, what topics to publish on etc. While I have a decent idea of where the industry is heading and what data professionals are looking forward to learn and why etc, it is high time I walked in their shoes for a couple of reasons. Basically, I want to understand the reason behind Data Science being the ‘Sexiest job of the 21st century’, and if the role is really worth all the fame and fortune. In the process, I also wanted to explore the underlying difficulties, challenges and obstacles that every data scientist has had to endure at some point in his/her journey, or still does, maybe. The cherry on top, is that I get to use the skills I develop, to supercharge my success in my current role that is primarily insight-driven. This is the first of a series of posts on how I got started with Data Science. Today, I’m sharing my experience with devising a learning path and then gathering appropriate learning resources. Devising a learning path To understand the concepts of data science, I had to research a lot. There are tons and tons of resources out there, many of which are very good. Once you seperate the good from the rest, it can be quite intimidating to pick the options that suit you the best. Some of the primary questions that clouded my mind were: What should be my programming language of choice? R or Python? Or something else? What tools and frameworks do I need to learn? What about the statistics and mathematical aspects of machine learning? How essential are they? Two videos really helped me find the answers to the questions above: If you don’t want to spend a lot of your time mastering the art of data science, there’s a beautiful video on how to become a data scientist in six months What are the questions asked in a data science interview? What are the in-demand skills that you need to master in order to get a data science job? This video on 5 Tips For Getting a Data Science Job really is helpful. After a lot of research that included reading countless articles and blogs and discussions with experts, here is my learning plan: Learn Python Per the recently conducted Stack Overflow Developer Survey 2018, Python stood out as the most-wanted programming language, meaning the developers who do not use it yet want to learn it the most. As one of the most widely used general-purpose programming languages, Python finds large applications when it comes to data science. Naturally, you get attracted to the best option available, and Python was the one for me. The major reasons why I chose to learn Python over the other programming languages: Very easy to learn: Python is one of the easiest programming languages to learn. Not only is the syntax clean and easy to understand, even the most complex of data science tasks can be done in a few lines of Python code. Efficient libraries for Data Science: Python has a vast array of libraries suited for various data science tasks, from scraping data to visualizing and manipulating it. NumPy, SciPy, pandas, matplotlib, Seaborn are some of the libraries worth mentioning here. Python has terrific libraries for machine learning: Learning a framework or a library which makes machine learning easier to perform is very important. Python has libraries such as scikit-learn and Tensorflow that makes machine learning easier and a fun-to-do activity. To make the most of these libraries, it is important to understand the fundamentals of Python. My colleague and good friend Aaron has put out a list of top 7 Python programming books which helped as a brilliant starting point to understand the different resources out there to learn Python. The one book that stood out for me was Learn Python Programming - Second Edition - This is a very good book to start Python programming from scratch. There is also a neat skill-map present on Mapt, where you can progressively build up your knowledge of Python - right from the absolute basics to the most complex concepts. Another handy resource to learn the A-Z of Python is Complete Python Masterclass. This is a slightly long course, but it will take you from the absolute fundamentals to the most advanced aspects of Python programming. Task Status: Ongoing Learn the fundamentals of data manipulation After learning the fundamentals of Python programming, the plan is to head straight to the Python-based libraries for data manipulation, analysis and visualization. Some of the major ones are what we already discussed above, and the plan to learn them is in the following order: NumPy - Used primarily for numerical computing pandas - One of the most popular Python packages for data manipulation and analysis matplotlib - The go-to Python library for data visualization, rivaling the likes of R’s ggplot2 Seaborn - A data visualization library that runs on top of matplotlib used for creating visually appealing charts, plots and histograms Some very good resources to learn about all these libraries: Python Data Analysis Python for Data Science and Machine Learning - This is a very good course with a detailed coverage on the machine learning concepts. Something to learn later. The aim is to learn these libraries upto a fairly intermediate level, and be able to manipulate, analyze and visualize any kind of data, including missing, unstructured data and time-series data. Understand the fundamentals of statistics, linear algebra and probability In order to take a step further and enter into the foray of machine learning, the general consensus is to first understand the maths and statistics behind the concepts of machine learning. Implementing them in Python is relatively easier once you get the math right, and that is what I plan to do. I shortlisted some very good resources for this as well: Statistics for Machine Learning Stanford University - Machine Learning Course at Coursera Task Status: Ongoing Learn Machine Learning (Sounds odd I know) After understanding the math behind machine learning, the next step is to learn how to perform predictive modeling using popular machine learning algorithms such as linear regression, logistic regression, clustering, and more. Using real-world datasets, the plan is to learn the art of building state-of-the-art machine learning models using Python’s very own scikit-learn library, as well as the popular Tensorflow package. To learn how to do this, the courses I mentioned above should come in handy: Stanford University - Machine Learning Course at Coursera Python for Data Science and Machine Learning Python Machine Learning, Second Edition Task Status: To be started [box type="shadow" align="" class="" width=""]During the course of this journey, websites like Stack Overflow and Stack Exchange will be my best friends, along with the popular resources such as YouTube.[/box] As I start this journey, I plan to share my experiences and knowledge with you all. Do you think the learning path looks good? Is there anything else that I should include in my learning path? I would really love to hear your comments, suggestions and experiences. Stay tuned for the next post where I seek answers to questions such as ‘How much of Python should I learn in order to be comfortable with Data Science?’, ‘How much time should I devote per day or week to learn the concepts in Data Science?’ and much more.. Read more Why is data science important? 9 Data Science Myths Debunked 30 common data science terms explained

0
0
4689

article-image-top-five-questions-to-ask-when-evaluating-a-data-monitoring-solution

Guest Contributor

27 Oct 2018

6 min read

What is Statistical Analysis and why does it matter?

Sugandha Lahoti

02 Oct 2018

6 min read

0
0
4396

Tech Guides - Data Analysis

Top 7 libraries for geospatial analysis

Alteryx vs. Tableau: Choosing the right data analytics tool for your business

What does a data science team look like?

How everyone at Netflix uses Jupyter notebooks from data scientists, machine learning engineers, to data analysts

8 ways to improve your data visualizations

5 best practices to perform data wrangling with Python

30 common data science terms explained

9 Data Science Myths Debunked

Uber's kepler.gl, an open source toolbox for GeoSpatial Analysis

Python, Tensorflow, Excel and more - Data professionals reveal their top tools

Messaging app Telegram's updated Privacy Policy is an open challenge

GDPR is pushing the Chief Data Officer role center stage

Data science for non-techies: How I got started (Part 1)

Top five questions to ask when evaluating a Data Monitoring solution

What is Statistical Analysis and why does it matter?

Tech Guides - Data Analysis

Top 7 libraries for geospatial analysis

Alteryx vs. Tableau: Choosing the right data analytics tool for your business

What does a data science team look like?

How everyone at Netflix uses Jupyter notebooks from data scientists, machine learning engineers, to data analysts

8 ways to improve your data visualizations

5 best practices to perform data wrangling with Python

30 common data science terms explained

9 Data Science Myths Debunked

Uber's kepler.gl, an open source toolbox for GeoSpatial Analysis

Python, Tensorflow, Excel and more - Data professionals reveal their top tools

Trending Topics

Messaging app Telegram's updated Privacy Policy is an open challenge

GDPR is pushing the Chief Data Officer role center stage

Data science for non-techies: How I got started (Part 1)

Top five questions to ask when evaluating a Data Monitoring solution

What is Statistical Analysis and why does it matter?