Big Data | 50 articles | Tech News, Tutorials & Expert Insights

article-image-top-5-programming-languages-big-data

04 Apr 2018

8 min read

Top 5 programming languages for crunching Big Data effectively

04 Apr 2018

One of the most important decisions that Big Data professionals have to make, especially the ones who are new to the scene or are just starting out, is choosing the best programming languages for big data manipulation and analysis. Understanding the Big Data problem and framing the architecture to solve it is not quite enough these days - the execution needs to be perfect as well, and choosing the right language goes a long way. The best languages for big data In this article, we look at the 5 of the most popularly used - not to mention highly effective - programming languages for developing Big Data solutions. Scala A beautiful crossover of the object-oriented and functional programming paradigms, Scala is fast and robust, and a popular choice of language for many Big Data professionals.The fact that two of the most popular Big Data processing frameworks in Apache Spark and Apache Kafka have been built on top of Scala tells you everything you need to know about the power of Scala. Scala runs on the JVM, which means the codes written in Scala can be easily used within a Java-based Big Data ecosystem. One significant factor that differentiates Scala from Java, though, is that Scala is a lot less verbose in comparison. You can write 100s of lines of confusing-looking Java code in less than 15 lines in Scala. One negative aspect of Scala, though, is its steep learning curve when compared to languages like Go and Python, and this may put off beginners looking to use it. Why use Scala for big data? Fast and robust Suitable for working with Big Data tools like Apache Spark for distributed Big Data processing JVM compliant, can be used in a Java-based ecosystem Python Python has been declared as one of the fastest growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Its general-purpose nature means it can be used across a broad spectrum of use-cases, and Big Data programming is one major area of application. Many libraries for data analysis and manipulation which are increasingly being used in a Big Data framework to clean and manipulate large chunks of data, such as pandas, NumPy, SciPy - are all Python-based. Not just that, most popular machine learning and deep learning frameworks such as scikit-learn, Tensorflow and many more, are also written in Python and are finding increasing application within the Big Data ecosystem. One drawback of using Python, and a reason why it is not a first-class citizen when it comes to Big Data programming yet, is that it’s slow. Although very easy to use, Big Data professionals have found systems built with languages such as Java or Scala faster and more robust to use than the systems built with Python. However, Python makes up for this limitation with other qualities. As Python is primarily a scripting language, interactive coding and development of analytical solutions for Big Data becomes very easy. Python can integrate effortlessly with the existing Big Data frameworks such as Apache Hadoop and Apache Spark, allowing you to perform predictive analytics at scale without any problem. Why use Python for big data? General-purpose Rich libraries for data analysis and machine learning Easy to use Supports iterative development Rich integration with Big Data tools Interactive computing through Jupyter notebooks R It won’t come as a surprise to many that those who love statistics, love R. The ‘language of statistics’ as it is popularly called as, R is used to build data models which can be used for effective and accurate data analysis. Powered by a large repository of R packages (CRAN, also called as Comprehensive R Archive Network), with R you have just about every type of tool to accomplish any task in Big Data processing - right from analysis to data visualization. R can be integrated seamlessly with Apache Hadoop and Apache Spark, among other popular frameworks, for Big Data processing and analytics. One issue with using R as a programming language for Big Data is that it is not very general-purpose. It means the code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java. That said, if your goal is to only build statistical models for Big Data analytics, R is an option you should definitely consider. Why use R for big data? Built for data science Support for Hadoop and Spark Strong statistical modeling and visualization capabilities Support for Jupyter notebooks Java Last, but not the least, there’s always the good old Java. Some of the traditional Big Data frameworks such as Apache Hadoop and all the tools within its ecosystem are all Java-based, and still in use today in many enterprises. Not to mention the fact that Java is the most stable and production-ready language among all the languages we have discussed so far! Using Java to develop your Big Data applications gives you the ability to use a large ecosystem of tools and libraries for interoperability, monitoring and much more, most of which have already been tried and tested. One major drawback of Java is its verbosity. The fact that you have to write hundreds of lines of codes in Java for a task which can written in barely 15-20 lines of code in Python or Scala, can turnoff many budding programmers. However, the introduction of lambda functions in Java 8 does make life quite easier. Java also does not support iterative development unlike newer languages like Python, and this is an area of focus for the future Java releases. Despite the flaws, Java remains a strong contender when it comes to the preferred language for Big Data programming because of its history and the continued reliance on the traditional Big Data tools and frameworks. Why use Java for big data? Traditional Big Data tools and frameworks are written in Java Stable and production-ready Large ecosystem of tried and tested tools and libraries Go Last but not the least, there’s Go - one of the fastest rising programming languages in recent times. Designed by a group of Google engineers who were frustrated with C++, we think Go is a good shout in this list - simply because of the fact that it powers so many tools used in the Big Data infrastructure, including Kubernetes, Docker and many more. Go is fast, easy to learn, and fairly easy to develop applications with, not to mention deploy them. More importantly, as businesses look at building data analysis systems that can operate at scale, Go-based systems are being used to integrate machine learning and parallel processing of data. It is also possible to interface other languages with Go-based systems with relative ease. Why use Go for big data? Fast, easy to use Many tools used in the Big Data infrastructure are Go-based Efficient distributed computing There are a few other languages you might want to consider - Julia, SAS and MATLAB being some major ones which are useful in their own right. However, when compared to the languages we talked about above, we thought they fell a bit short in some aspects - be it speed, efficiency, ease of use, documentation, or community support, among other things. Let’s take a quick look at the comparison table of all the languages we discussed above. Note that we have used the ✓ symbol for the best possible language/s to help you make an informed decision. This is just our view, and that’s not to say that the other languages are any worse! Scala Python R Java Go Speed ✓ ✓ ✓ Ease of use ✓ ✓ ✓ Quick Learning curve ✓ ✓ Data Analysis capability ✓ ✓ ✓ General-purpose ✓ ✓ ✓ ✓ Big Data support ✓ ✓ ✓ ✓ ✓ Interfacing with other languages ✓ ✓ ✓ Production-ready ✓ ✓ ✓ So...which language should you choose? To answer the question in short - it all depends on the use-case you want to develop. If your focus is hardcore data analysis which involves a lot of statistical computing, R would be your go-to language. On the other hand, if you want to develop streaming applications for your Big Data, Scala can be a preferable choice. If you wish to use Machine Learning to leverage your Big Data and build predictive models, Python will come to your rescue. Lastly, if you plan to build Big Data solutions using just the traditionally-available tools, Java is the language for you. You also have the option of combining the power of two languages to get a more efficient and powerful solution. For example, you can train your machine learning model in Python and deploy it on Spark in a distributed mode. Ultimately, it all depends on how efficiently your solution can function, and more importantly, how fast and accurate it is. Which language do you prefer for crunching your Big Data? Do let us know!

0
1
17264

Aaron Lazar

23 Apr 2018

5 min read

Why is Hadoop dying?

Aaron Lazar

23 Apr 2018

5 min read

Hadoop has been the definitive big data platform for some time. The name has practically been synonymous with the field. But while its ascent followed the trajectory of what was referred to as the 'big data revolution', Hadoop now seems to be in danger. The question is everywhere - is Hadoop dying out? And if it is, why is it? Is it because big data is no longer the buzzword it once was, or are there simply other ways of working with big data that have become more useful? Hadoop was essential to the growth of big data When Hadoop was open sourced in 2007, it opened the door to big data. It brought compute to data, as against bringing data to compute. Organisations had the opportunity to scale their data without having to worry too much about the cost. It obviously had initial hiccups with security, the complexity of querying and querying speeds, but all that was taken care off, in the long run. Still, although querying speeds remained quite a pain, however that wasn’t the real reason behind Hadoop dying (slowly). As cloud grew, Hadoop started falling One of the main reasons behind Hadoop's decline in popularity was the growth of cloud. There cloud vendor market was pretty crowded, and each of them provided their own big data processing services. These services all basically did what Hadoop was doing. But they also did it in an even more efficient and hassle-free way. Customers didn't have to think about administration, security or maintenance in the way they had to with Hadoop. One person’s big data is another person’s small data Well, this is clearly a fact. Several organisations that used big data technologies without really gauging the amount of data they actually would need to process, have suffered. Imagine sitting with 10TB Hadoop clusters when you don’t have that much data. The two biggest organisations that built products on Hadoop, Hortonworks and Cloudera, saw a decline in revenue in 2015, owing to their massive use of Hadoop. Customers weren’t pleased with nature of Hadoop’s limitations. Apache Hadoop v Apache Spark Hadoop processing is way behind in terms of processing speed. In 2014 Spark took the world by storm. I’m going to let you guess which line in the graph above might be Hadoop, and which might be Spark. Spark was a general purpose, easy to use platform that was built after studying the pitfalls of Hadoop. Spark was not bound to just the HDFS (Hadoop Distributed File System) which meant that it could leverage storage systems like Cassandra and MongoDB as well. Spark 2.3 was also able to run on Kubernetes; a big leap for containerized big data processing in the cloud. Spark also brings along GraphX, which allows developers to view data in the form of graphs. Some of the major areas Spark wins are Iterative Algorithms in Machine Learning, Interactive Data Mining and Data Processing, Stream processing, Sensor data processing, etc. Machine Learning in Hadoop is not straightforward Unlike MLlib in Spark, Machine Learning is not possible in Hadoop unless tied with a 3rd party library. Mahout used to be quite popular for doing ML on Hadoop, but its adoption has gone down in the past few years. Tools like RHadoop, a collection of 3 R packages, have grown for ML, but it still is nowhere comparable to the power of the modern day MLaaS offerings from cloud providers. All the more reason to move away from Hadoop, right? Maybe. Hadoop is not only Hadoop The general misconception is that Hadoop is quickly going to be extinct. On the contrary, the Hadoop family consists of YARN, HDFS, MapReduce, Hive, Hbase, Spark, Kudu, Impala, and 20 other products. While e folks may be moving away from Hadoop as their choice for big data processing, they will still be using Hadoop in some form or the other. As with Cloudera and Hortonworks, though the market has seen a downward trend, they’re in no way letting go of Hadoop anytime soon, although they have shifted part of their processing operations to Spark. Is Hadoop dying? Perhaps not... In the long run, it’s not completely accurate to say that Hadoop is dying. December last year brought with it Hadoop 3.0, which is supposed to be a much improved version of the framework. Some of the most noteworthy features are its improved shell script, more powerful YARN, improved fault tolerance with erasure coding, and many more. Although, that hasn’t caused any major spike in adoption, there are still users who will adopt Hadoop based on their use case, or simply use another alternative like Spark along with another framework from the Hadoop family. So, Hadoop’s not going away anytime soon. Read More Pandas is an effective tool to explore and analyze data - Interview insights

0
1
13946

article-image-healthcare-analytics-logistic-regression-to-reduce-patient-readmissions

Guest Contributor

20 Dec 2017

8 min read

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

Guest Contributor

20 Dec 2017

8 min read

[box type="info" align="" class="" width=""]We bring to you another guest post by Benjamin Rojogan on Logistic regression to aid healthcare sector in reducing patient readmission. Ben's previous post on ensemble methods to optimize machine learning models is also available for a quick read here.[/box] ER visits are not cheap for any party involved. Whether this be the patient or the insurance company. However, this does not stop some patients from being regular repeat visitors. These recurring visits are due to lack of intervention for problems such as substance abuse, chronic diseases and mental illness. This increases costs for everybody in the healthcare system and reduces quality of care by playing a role in the overflowing of Emergency Departments (EDs). Research teams at UW and other universities are partnering with companies like Kensci to figure out how to approach the problem of reducing readmission rates. The ability to predict the likelihood of a patient’s readmission will allow for targeted intervention which in turn will help reduce the frequency of readmissions. Thus making the population healthier and hopefully reducing the estimated 41.3 billion USD healthcare costs for the entire system. How do they plan to do it? With big data and statistics, of course. A plethora of algorithms are available for data scientists to use to approach this problem. Many possible variables could affect the readmission and medical costs. Also, there are also many different ways researchers might pose their questions. However, the researchers at UW and many other institutions have been heavily focused on reducing the readmission rate simply by trying to calculate whether a person would or would not be readmitted. In particular, this team of researchers was curious about chronic ailments. Patients with chronic ailments are likely to have random flare ups that require immediate attention. Being able to predict if a patient will have an ER visit can lead to managing the cause more effectively. One approach taken by the data science team at UW as well as the Department of Family and Community Medicine at the University of Toronto was to utilize logistic regression to predict whether or not a patient would be readmitted. Patient readmission can be broken down into a binary output: either the patient is readmitted or not. As such logistic regression has been a useful model in my experience to approach this problem. Logistic Regression to predict patient readmissions Why do data scientists like to use logistic regression? Where is it used? And how does it compare to other data algorithms? Logistic regression is a statistical method that statisticians and data scientists use to classify people, products, entities, etc. It is used for analyzing data that produces a binary classification based on one or many independent variables. This means, it produces two clear classifications (Yes or No, 1 or 0, etc). With the example above, the binary classification would be: is the patient readmitted or not? Other examples of this could be whether to give a customer a loan or not, whether a medical claim is fraud or not, whether a patient has diabetes or not. Despite its name, logistic regression does not provide the same output like linear regression (per se). There are some similarities, for instance, the linear model is somewhat consistent as you might notice in the equation below where you see what is very similar to a linear equation. But the final output is based on the log odds. Linear regression and multivariate regression both take one to many independent variables and produce some form of continuous function. Linear regression could be used to predict the price of a house, a person’s age or the cost of a product an e-commerce should display to each customer. The output is not limited to only a few discrete classifications. Whereas logistic regression produces discrete classifiers. For instance, an algorithm using logistic regression could be used to classify whether or not a certain stock price would be either >$50 a share or <$50 a share. Linear regression would be used to predict if a stock share would be worth $50.01, $50.02….etc. Logistic regression is a calculation that uses the odds of a certain classification. In the equation above, the symbol you might know as pi actually represents the odds or probability. To reduce the error rate, we should predict Y = 1 when p ≥ 0.5 and Y = 0 when p < 0.5. This creates a linear classifier, a boundary that when the coefficients β0 + x · β has a p value that is p < 0.5 then Y = 0. By generating coefficients that help predict the logit transformation, the method allows to classify for the characteristic of interest. Now that is a lot of complex math mumbo jumbo. Let’s try to break it down into simpler terms. Probability vs. Odds Let’s start with probability. Let’s say a patient has the probability of 0.6 of being readmitted. Then the probability that the patient won’t be readmitted is .4. Now, we want to take this and convert it into odds. This is what the formula above is doing. You would take .6/.4 and get odds of 1.5. That means the odds of the patient being readmitted are 1.5 to 1. If instead the probability was .5 for both being readmitted and not being readmitted, then the odds would be 1:1. Now the next step in the logistic regression model would be to take the odds and get the “Log odds”. You do this by taking the 1.5 and put it into the log portion of the equation. Now you will get .18(rounded). In logistic regression, we don’t actually know p. That is what we are trying to essentially find and model using various coefficients and input variables. Each input provides a value that changes how much more likely an event will or will not occur. All of these coefficients are used to calculate the log odds. This model can take multiple variables like age, sex, height, etc. and specify how much of an effect each variable has on the odds an event will occur. Once the initial model is developed, then comes the work of deciding its value. How does a business go from creating an algorithm inside a computer and translate it into action. Some of us like to say the “computers” are the easy part. Personally I find the hard part to be the “people”. After all, at the end of the day, it comes down to business value. Will an algorithm save money or not? That means it has to be applied in real life. This could take the form of a new initiative, strategy, product recommendation, etc. You need to find the outliers that are worth going after! For instance, if we go back to the patient readmission example again. The algorithm points out patients with high probabilities of being readmitted. However if the readmission costs are low, they will probably be ignored..sadly. That is how businesses (including hospitals) look at problems. Logistic regression is a great tool for binary classification. It is unlike many other algorithms that estimate continuous variables or estimate distributions. This statistical method can be utilized to classify whether a person will be likely to get cancer because of environmental variables like proximity to a highway, smoking habits, etc? This method has been used effectively in the medical, financial and insurance industry successfully for a while. Knowing when to use what algorithm takes time. However, the more problems a data scientist faces, the faster they will recognize whether to use logistic regression or decision trees. Using logistic regression provides the opportunity for healthcare institutions to accurately target at risk individuals who should receive a more tailored behavioral health plan to help improve their daily health habits. This in turn opens the opportunity for better health for patients and lower costs for hospitals. [box type="shadow" align="" class="" width=""] About the Author Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/box]

0
0
13412

article-image-create-strong-data-science-project-portfolio-lands-job

Aaron Lazar

13 Feb 2018

8 min read

How to create a strong data science project portfolio that lands you a job

Aaron Lazar

13 Feb 2018

8 min read

Okay, you’re probably here because you’ve got just a few months to graduate and the projects section of your resume is blank. Or you’re just an inquisitive little nerd scraping the WWW for ways to crack that dream job. Either way, you’re not alone and there are ten thousand others trying to build a great Data Science portfolio to land them a good job. Look no further, we’ll try our best to help you on how to make a portfolio that catches the recruiter’s eye! David “Trent” Salazar‘s portfolio is a great example of a wholesome one and Sajal Sharma’s, is a good example of how one can display their Data Science Portfolios on a platform like Github. Companies are on the lookout for employees who can add value to the business. To showcase this on your resume effectively, the first step is to understand the different ways in which you can add value. 4 things you need to show in a data science portfolio Data science can be broken down into 4 broad areas: Obtaining insights from data and presenting them to the business leaders Designing an application that directly benefits the customer Designing an application or system that directly benefits other teams in the organisation Sharing expertise on data science with other teams You’ll need to ensure that your portfolio portrays all or at least most of the above, in order to easily make it through a job selection. So let’s see what we can do to make a great portfolio. Demonstrate that you know what you're doing So the idea is to show the recruiter that you’re capable of performing the critical aspects of Data Science, i.e. import a data set, clean the data, extract useful information from the data using various techniques, and finally visualise the findings and communicate them. Apart from the technical skills, there are a few soft skills that are expected as well. For instance, the ability to communicate and collaborate with others, the ability to reason and take the initiative when required. If your project is actually able to communicate these things, you’re in! Stay focused and be specific You might know a lot, but rather than throwing all your skills, projects and knowledge in the employer’s face, it’s always better to be focused on doing something and doing it right. Just as you’d do in your resume, keeping things short and sweet, you can implement this while building your portfolio too. Always remember, the interviewer is looking for specific skills. Research the data science job market Find 5-6 jobs, probably from Linkedin or Indeed, that interest you and go through their descriptions thoroughly. Understand what kind of skills the employer is looking for. For example, it could be classification, machine learning, statistical modeling or regression. Pick up the tools that are required for the job - for example, Python, R, TensorFlow, Hadoop, or whatever might get the job done. If you don’t know how to use that tool, you’ll want to skill-up as you work your way through the projects. Also, identify the kind of data that they would like you to be working on, like text or numerical, etc. Now, once you have this information at hand, start building your project around these skills and tools. Be a problem solver Working on projects that are not actual ‘problems’ that you’re solving, won’t stand out in your portfolio. The closer your projects are to the real-world, the easier it will be for the recruiter to make their decision to choose you. This will also showcase your analytical skills and how you’ve applied data science to solve a prevailing problem. Put at least 3 diverse projects in your data science portfolio A nice way to create a portfolio is to list 3 good projects that are diverse in nature. Here are some interesting projects to get you started on your portfolio: Data Cleaning and wrangling Data Cleaning is one of the most critical tasks that a data scientist performs. By taking a group of diverse data sets, consolidating and making sense of them, you’re giving the recruiter confidence that you know how to prep them for analysis. For example, you can take Twitter or Whatsapp data and clean it for analysis. The process is pretty simple; you first find a “dirty” data set, then spot an interesting angle to approach the data from, clean it up and perform analysis on it, and finally present your findings. Data storytelling Storytelling showcases not only your ability to draw insight from raw data, but it also reveals how well you’re able to convey the insights to others and persuade them. For example, you can use data from the bus system in your country and gather insights to identify which stops incur the most delays. This could be fixed by changing their route. Make sure your analysis is descriptive and your code and logic can be followed. Here’s what you do; first you find a good dataset, then you explore the data and spot correlations in the data. Then you visualize it before you start writing up your narrative. Tackle the data from various angles and pick up the most interesting one. If it’s interesting to you, it will most probably be interesting to anyone else who’s reviewing it. Break down and explain each step in detail, each code snippet, as if you were describing it to a friend. The idea is to teach the reviewer something new as you run through the analysis. End to end data science If you’re more into Machine Learning, or algorithm writing, you should do an end-to-end data science project. The project should be capable of taking in data, processing it and finally learning from it, every step of the way. For example, you can pick up fuel pricing data for your city or maybe stock market data. The data needs to be dynamic and updated regularly. The trick for this one is to keep the code simple so that it’s easy to set up and run. You first need to identify a good topic. Understand here that we will not be working with a single dataset, rather you will need to import and parse all the data and bring it under a single dataset yourself. Next, get the training and test data ready to make predictions. Document your code and other findings and you’re good to go. Prove you have the data science skill set If you want to get that job, you’ve got to have the appropriate tools to get the job done. Here’s a list of some of the most popular tools with a link to the right material for you to skill-up: Data science languages There's a number of key languages in data science that are essential. It might seem obvious, but making sure they're on your resume and demonstrated in your portfolio is incredibly important. Include things like: Python R Java Scala SQL Big Data tools If you're applying for big data roles, demonstrating your experience with the key technologies is a must. It not only proves you have the skills, but also shows that you have an awareness of what tools can be used to build a big data solution or project. You'll need: Hadoop, Spark Hive Machine learning frameworks With machine learning so in demand, if you can prove you've used a number of machine learning frameworks, you've already done a lot to impress. Remember, many organizations won't actually know as much about machine learning as you think. In fact, they might even be hiring you with a view to building out this capability. Remember to include: TensorFlow Caffe2 Keras PyTorch Data visualisation tools Data visualization is a crucial component of any data science project. If you can visualize and communicate data effectively, you're immediately demonstrating you're able to collaborate with others and make your insights accessible and useful to the wider business. Include tools like these in your resume and portfolio: D3.js Excel chart Tableau ggplot2 So there you have it. You know what to do to build a decent data science portfolio. It’s really worth attending competitions and challenges. It will not only help you keep up to data and well oiled with your skills, but also give you a broader picture of what people are actually working on and with what tools they’re able to solve problems.

0
2
11775

article-image-two-popular-data-analytics-methodologies-every-data-professional-should-know-tdsp-crisp-dm

Amarabha Banerjee

21 Dec 2017

7 min read

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

Amarabha Banerjee

21 Dec 2017

7 min read

0
0
11456

Akram Hussain

31 Oct 2014

3 min read

Python Data Stack

Akram Hussain

31 Oct 2014

3 min read

The Python programming language has grown significantly in popularity and importance, both as a general programming language and as one of the most advanced providers of data science tools. There are 6 key libraries every Python analyst should be aware of, and they are: 1 - NumPY NumPY: Also known as Numerical Python, NumPY is an open source Python library used for scientific computing. NumPy gives both speed and higher productivity using arrays and metrics. This basically means it's super useful when analyzing basic mathematical data and calculations. This was one of the first libraries to push the boundaries for Python in big data. The benefit of using something like NumPY is that it takes care of all your mathematical problems with useful functions that are cleaner and faster to write than normal Python code. This is all thanks to its similarities with the C language. 2 - SciPY SciPY: Also known as Scientific Python, is built on top of NumPy. SciPy takes scientific computing to another level. It’s an advanced form of NumPy and allows users to carry out functions such as differential equation solvers, special functions, optimizers, and integrations. SciPY can be viewed as a library that saves time and has predefined complex algorithms that are fast and efficient. However, there are a plethora of SciPY tools that might confuse users more than help them. 3 - Pandas Pandas is a key data manipulation and analysis library in Python. Pandas strengths lie in its ability to provide rich data functions that work amazingly well with structured data. There have been a lot of comparisons between pandas and R packages due to their similarities in data analysis, but the general consensus is that it is very easy for anyone using R to migrate to pandas as it supposedly executes the best features of R and Python programming all in one. 4 - Matplotlib Matplotlib is a visualization powerhouse for Python programming, and it offers a large library of customizable tools to help visualize complex datasets. Providing appealing visuals is vital in the fields of research and data analysis. Python’s 2D plotting library is used to produce plots and make them interactive with just a few lines of code. The plotting library additionally offers a range of graphs including histograms, bar charts, error charts, scatter plots, and much more. 5 - scikit-learn scikit-learn is Python’s most comprehensive machine learning library and is built on top of NumPy and SciPy. One of the advantages of scikit-learn is the all in one resource approach it takes, which contains various tools to carry out machine learning tasks, such as supervised and unsupervised learning. 6 - IPython IPython makes life easier for Python developers working with data. It’s a great interactive web notebook that provides an environment for exploration with prewritten Python programs and equations. The ultimate goal behind IPython is improved efficiency thanks to high performance, by allowing scientific computation and data analysis to happen concurrently using multiple third-party libraries. Continue learning Python with a fun (and potentially lucrative!) way to use decision trees. Read on to find out more.

0
0
10429

article-image-what-does-a-data-science-team-look-like

Fatema Patrawala

21 Nov 2019

11 min read

What does a data science team look like?

Fatema Patrawala

21 Nov 2019

11 min read

Until a couple of years ago, people barely knew the term 'data science' which has now evolved into an extremely popular career field. The Harvard Business Review dubbed data scientist within the data science team as the sexiest job of the 21st century and expert professionals jumped on the data is the new oil bandwagon. As per the Figure Eight Report 2018, which takes the pulse of the data science community in the US, a lot has changed rapidly in the data science field over the years. For the 2018 report, they surveyed approximately 240 data scientists and found out that machine learning projects have multiplied and more and more data is required to power them. Data science and machine learning jobs are LinkedIn's fastest growing jobs. And the internet is creating 2.5 quintillion bytes of data to process and analyze each day. With all these changes, it is evident for data science teams to evolve and change among various organizations. The data science team is responsible for delivering complex projects where system analysis, software engineering, data engineering, and data science is used to deliver the final solution. To achieve all of this, the team does not only have a data scientist or a data analyst but also includes other roles like business analyst, data engineer or architect, and chief data officer. In this post, we will differentiate and discuss various job roles within a data science team, skill sets required and the compensation benefit for each one of them. For an in-depth understanding of data science teams, read the book, Managing Data Science by Kirill Dubovikov, which has interesting case studies on building successful data science teams. He also explores how the team can efficiently manage data science projects through the use of DevOps and ModelOps. Now let's get into understanding individual data science roles and functions, but before that we take a look at the structure of the team.There are three basic team structures to match different stages of AI/ML adoption: IT centric team structure At times for companies hiring a data science team is not an option, and they have to leverage in-house talent. During such situations, they take advantage of the fully functional in-house IT department. The IT team manages functions like data preparation, training models, creating user interfaces, and model deployment within the corporate IT infrastructure. This approach is fairly limited, but it is made practical by MLaaS solutions. Environments like Microsoft Azure or Amazon Web Services (AWS) are equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy. Microsoft Azure, for instance, supports its users with detailed documentation for a low entry threshold. The documentation helps in fast training and early deployment of models even without an expert data scientists on board. Integrated team structure Within the integrated structure, companies have a data science team which focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure for model deployment. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations. Unlike the IT centric approach, the integrated method requires having an experienced data scientist within the team. This approach ensures better operational flexibility in terms of available techniques. Additionally, the team leverages deeper understanding of machine learning tools and libraries – like TensorFlow or Theano which are specifically for researchers and data science experts. Specialized data science team Companies can also have an independent data science department to build an all-encompassing machine learning applications and frameworks. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are handled by a dedicated data science team. It doesn't necessarily mean that all team members should have a data science background, but they should have technology background with certain service management skills. A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML backed services. Today's most successful Silicon Valley tech operates with specialized data science teams. Additionally they are custom-built and wired for specific tasks to achieve different business goals. For example, the team structure at Airbnb is one of the most interesting use cases. Martin Daniel, a data scientist at Airbnb in this talk explains how the team emphasizes on having an experimentation-centric culture and apply machine learning rigorously to address unique product challenges. Job roles and responsibilities within data science team As discussed earlier, there are many roles within a data science team. As per Michael Hochster, Director of Data Science at Stitch Fix, there are two types of data scientists: Type A and Type B. Type A stands for analysis. Individuals involved in Type A are statisticians that make sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc. Type B stands for building. These individuals use data in production. They're good software engineers with strong programming knowledge and statistics background. They build recommendation systems, personalization use cases, etc. Though it is rare that one expert will fit into a single category. But understanding these data science functions can help make sense of the roles described further. Chief data officer/Chief analytics officer The chief data officer (CDO) role has been taking organizations by storm. A recent NewVantage Partners' Big Data Executive Survey 2018 found that 62.5% of Fortune 1000 business and technology decision-makers said their organization appointed a chief data officer. The role of chief data officer involves overseeing a range of data-related functions that may include data management, ensuring data quality and creating data strategy. He or she may also be responsible for data analytics and business intelligence, the process of drawing valuable insights from data. Even though chief data officer and chief analytics officer (CAO) are two distinct roles, it is often handled by the same person. Expert professionals and leaders in analytics also own the data strategy and how a company should treat its data. It does make sense as analytics provide insights and value to the data. Hence, with a CDO+CAO combination companies can take advantage of a good data strategy and proper data management without losing on quality. According to compensation analysis from PayScale, the median chief data officer salary is $177,405 per year, including bonuses and profit share, ranging from $118,427 to $313,791 annually. Skill sets required: Data science and analytics, programming skills, domain expertise, leadership and visionary abilities are required. Data analyst The data analyst role implies proper data collection and interpretation activities. The person in this job role will ensure that collected data is relevant and exhaustive while also interpreting the results of the data analysis. Some companies also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics. As per Indeed, the average salary for a data analyst is $68,195 per year in the United States. Skill sets required: Programming languages like R, Python, JavaScript, C/C++, SQL. With this critical thinking, data visualization and presentation skills will be good to have. Data scientist Data scientists are data experts who have the technical skills to solve complex problems and the curiosity to explore what problems are needed to be solved. A data scientist is an individual who develops machine learning models to make predictions and is well versed in algorithm development and computer science. This person will also know the complete lifecycle of the model development. A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets. According to Glassdoor, the current U.S. average salary for a data scientist is $118,709. Skills set required: A data scientist will require knowledge of big data platforms and tools like Seahorse powered by Apache Spark, JupyterLab, TensorFlow and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R. They should also have cloud computing capabilities and knowledge of various cloud platforms like AWS, Microsoft Azure etc.You can also read this post on how to ace a data science interview to know more. Machine learning engineer At times a data scientist is confused with machine learning engineers, but a machine learning engineer is a distinct role that involves different responsibilities. A machine learning engineer is someone who is responsible for combining software engineering and machine modeling skills. This person determines which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is the ML engineer's job. The average machine learning engineer's salary is $146,085 in the US, and is ranked No.1 on the Indeed's Best Jobs in 2019 list. Skill sets required: Machine learning engineers will be required to have expertise in computer science and programming languages like R, Python, Scala, Java etc. They would also be required to have probability techniques, data modelling and evaluation techniques. Data architects and data engineers The data architects and data engineers work in tandem to conceptualize, visualize, and build an enterprise data management framework. The data architect visualizes the complete framework to create a blueprint, which the data engineer can use to build a digital framework. The data engineering role has recently evolved from the traditional software-engineering field. Recent enterprise data management experiments indicate that the data-focused software engineers are needed to work along with the data architects to build a strong data architecture. Average salary for a data architect in the US ranges from $1,22,000 to $1,29, 000 annually as per a recent LinkedIn survey. Skill sets required: A data architect or an engineer should have a keen interest and experience in programming languages frameworks like HTML5, RESTful services, Spark, Python, Hive, Kafka, and CSS etc. They should have the required knowledge and experience to handle database technologies such as PostgreSQL, MapReduce and MongoDB and visualization platforms such as; Tableau, Spotfire etc. Business analyst A business analyst (BA) basically handles Chief analytics officer's role but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst can bridge the gap. They are responsible for using data analytics to assess processes, determine requirements and deliver data-driven recommendations and reports to executives and stakeholders. BAs engage with business leaders and users to understand how data-driven changes will be implemented to processes, products, services, software and hardware. They further articulate these ideas and balance them against technologically feasible and financially reasonable. The average salary for a business analyst is $75,078 per year in the United States, as per Indeed. Skill sets required: Excellent domain and industry expertise will be required. With this good communication as well as data visualization skills and knowledge of business intelligence tools will be good to have. Data visualization engineer This specific role is not present in each of the data science teams as some of the responsibilities are realized by either a data analyst or a data architect. Hence, this role is only necessary for a specialized data science model. The role of a data visualization engineer involves having a solid understanding of UI development to create custom data visualization elements for your stakeholders. Regardless of the technology, successful data visualization engineers have to understand principles of design, both graphical and more generally user-centered design. As per Payscale, the average salary for a data visualization engineer is $98,264. Skill sets required: A data visualization engineer need to have rigorous knowledge of data visualization methods and be able to produce various charts and graphs to represent data. Additionally they must understand the fundamentals of design principles and visual display of information. To sum it up, a data science team has evolved to create a number of job roles and opportunities, but companies still face challenges in building up the team from scratch and find it hard to figure where to start from. If you are facing a similar dilemma, check out this book, Managing Data Science, written by Kirill Dubovikov. It covers concepts and methodologies to manage and deliver top-notch data science solutions, while also providing guidance on hiring, growing and sustaining a successful data science team. How to learn data science: from data mining to machine learning How to ace a data science interview Data science vs. machine learning: understanding the difference and what it means today 30 common data science terms explained 9 Data Science Myths Debunked

0
0
10156

article-image-15-useful-python-libraries-to-make-your-data-science-tasks-easier

Amey Varangaonkar

12 Feb 2018

10 min read

15 Useful Python Libraries to make your Data Science tasks Easier

Amey Varangaonkar

12 Feb 2018

10 min read

0
0
9052

article-image-looking-different-types-lookup-cache

Savia Lobo

20 Nov 2017

6 min read

Looking at the different types of Lookup cache

Savia Lobo

20 Nov 2017

6 min read

0
0
8595

article-image-how-do-data-structures-and-data-models-differ

Amey Varangaonkar

21 Dec 2017

7 min read

How do Data Structures and Data Models differ?

Amey Varangaonkar

21 Dec 2017

7 min read

0
0
7829

article-image-when-why-and-how-to-use-graph-analytics-for-your-big-data

Sunith Shetty

20 Dec 2017

10 min read

When, why and how to use Graph analytics for your big data

Sunith Shetty

20 Dec 2017

10 min read

0
0
7789

article-image-top-4-chatbot-development-frameworks-developers

Sugandha Lahoti

20 Oct 2017

8 min read

Top 4 chatbot development frameworks for developers

Sugandha Lahoti

20 Oct 2017

8 min read

The rise of the bots is nigh! If you can imagine a situation involving a dialog, there is probably a chatbot for that. Just look at the chatbot market - text-based email/SMS bots, voice-based bots, bots for customer support, transaction-based bots, entertainment bots and many others. A large number of enterprises, from startups to established organizations, are seeking to invest in this sector. This has also led to an increase in the number of platforms used for chatbot building. These frameworks incorporate AI techniques along with natural language processing capabilities to assist developers in building and deploying chatbots. Let’s start with how a chatbot typically works before diving into some of the frameworks. Understand: The first step for any chatbot is to understand the user input. This is made possible using pattern matching and intent classification techniques. ‘Intents’ are the tasks that users might want to perform with a chatbot. Machine learning, NLP and speech recognition techniques are typically used to identify the intent of the message and extract named entities. Entities are the specific pieces of information extracted from the user’s response i.e. the content associated with an intent. Respond: After understanding, the next goal is to generate a response. This is based on the current input message and the context of the conversation. After specifying the intents and entities, a dialog flow is constructed. This is basically the replies/feedback expected from a chatbot. Learn: Chatbots use AI techniques such as natural language understanding and pattern recognition to store and distinguish between the context of the information provided, and elicit a suitable response for future replies. This is important because different requests might have different meanings depending on previous requests. Top chatbot development frameworks A bot development framework is a set of predefined classes, functions, and utilities that a developer can use to build chatbots easier and faster. They vary in the level of complexity, integration capabilities, and functionalities. Let us look at some of the development platforms utilized for chatbot building. API.AI API.AI, a code based framework with a simple web-based interface, allows users to build engaging voice and text-based conversational apps using a large number of libraries and SDKs including Android, iOS, Webkit HTML5, Node.js, and Python API. It also supports nearly 32 one-click platform integrations such as Google, Facebook Messenger, Twitter and Skype to name a few. API.AI makes use of an agent - a container that transforms natural language based user requests into actionable data. The software tries to find the intent behind a user’s reply and matches it to the default or the closest match. After intent matching, it executes the actions and responses the developer has defined for that intent. API.AI also makes use of entities. Once the intents and entities are specified, the bot is trained. API.AI’s training module efficiently tracks each user’s request and lets developers see how they are parsed and matched to an intent. It also allows for correction of any errors and change requests thus retraining the bot. API.AI streamlines the entire bot-creating process by helping developers provide domain-specific knowledge that is unique to a bot’s needs while working on speech recognition, intent and context management in the backend. Google has recently partnered with API.AI to help them build conversational tools like Apple’s Siri. Microsoft Bot Framework Microsoft Bot Framework allows building and deployment of chatbots across multiple platforms and services such as web, SMS, non-Microsoft platforms, Office 365, Skype etc. The Bot Framework includes two components - The Bot Builder and the Microsoft Cognitive Services. The Bot Builder comprises of two full-featured SDKs - for the.NET and the Node.js platforms along with an emulator for testing and debugging. There’s also a set of RESTful APIs for building code in other languages. The SDKs support features for simple and easy interactions between bots. They also have a large collection of prebuilt sample bots for the developer to choose from. The Microsoft Cognitive Services is a collection of intelligent APIs that simplify a variety of AI tasks such as allowing the system to understand and interpret the user's needs using natural language in just a few lines of code. These APIs allow integration to most modern languages and platforms and constantly improve, learn, and get smarter. Microsoft created the AI Inner Circle Partner Program to work hand in hand with industry to create AI solutions. Their only partner in the UK is ICS.AI who build conversational AI solutions for the UK's public sector. ICS are the first choice for many organisations due to their smart solutions that scale and serve to improve services for the general public. Developers can build bots in the Bot Builder SDK using C# or Node.js. They can then add AI capabilities with Cognitive Services. Finally, they can register the bots on the developer portal, connecting it to users across platforms such as Facebook and Microsoft Teams and also deploy it on the cloud like Microsoft Azure. For a step-by-step guide for chatbot building using Microsoft Bot Framework, you can refer to one of our books on the topic. Sabre Corporation, a customer service provider for travel agencies, have recently announced the development of an AI-powered chatbot that leverages Microsoft Bot Framework and Microsoft Cognitive Services. Watson Conversation IBM’s Watson Conversation helps build chatbot solutions that understand natural-language input and use machine learning to respond to customers in a way that simulates conversations between humans. It is built on a neural network of one million Wikipedia words. It offers deployment across a variety of platforms including mobile devices, messaging platforms, and robots. The platform is robust and secure as IBM allows users to opt out of data sharing. The IBM Watson Tone Analyzer service can help bots understand the tone of the user’s input for better management of the experience. The basic steps to create a chatbot using Watson Conversation are as follows. We first create a workspace - a place for configuring information to maintain separate intents, user examples, entities, and dialogues for each application. One workspace corresponds to one bot. Next, we create Intents. Watson Conversation makes use of multiple conditioned responses to distinguish between similar intents. For example, instead of building specific intents for locations of different places, it creates a general intent “location” and adds an entity to capture the response, like the “location- bedroom” - to the right, near the stairs, “location-kitchen”- to the left. The third step is entity establishment. This involves grouping entities that might trigger a similar response in the dialog. The dialog flow, thus generated after specifying the intents and entities, goes through testing followed by embedding this into an application. It is then connected with other services by using the conversation API. Staples, an office supply retailing firm, uses Watson Conversation in their “Easy Systems” to simplify the customer’s shopping experience. CXP Designer and Aspect NLU Aspect Customer Experience Platform is an application lifecycle management tool to build text and voice-based applications such as chatbots. It provides deployment options across multiple communication channels like text, voice, mobile web and social media networks. The Aspect CXP typically includes a CXP designer to build chatbots and the inbuilt Aspect NLU to provide advanced natural language capabilities. CXP designer works by creating dialog objects to provide a menu of options for frontend as well as backend. Menu items for the frontend are used to create intents and modules within those intents. The developer can then modify labels (of those intents and modules) manually or use the Aspect NLU to disambiguate similar questions for successful extraction of meaning and intent. The Aspect NLU includes tools for spelling correction, linguistic lexicons such as nouns, verbs etc. and options for detecting and extracting common data types such as date, time, numbers, etc. It also allows a developer to modify the meaning extraction based on how they want it if they want it! CXP designer also allows skipping of certain steps in chatbots. For instance, if the user has already provided their tracking id for a particular package, the chatbot will skip the prompt of asking them the tracking id again. With Aspect CXP, developers can create and deploy complex chatbots. Radisson Blu Edwardian, a hotel in London, has collaborated with Aspect software to build an SMS based, AI virtual host. Conclusion Another popular chatbot development platform worth mentioning is the Facebook messenger with over 100,000 monthly active bots, but without cross-platform deployment features. The above bot frameworks are typically used by developers to build chatbots from scratch and require some programming skills. However, there has been a rise in automated bot development tools of late. Some of these include Chatfuel and Motion AI and typically involve drag and drop functionalities. With such tools, beginners and non-programmers can create and deploy chatbots within few minutes. But, they lack the extended functionalities supported by typical code based frameworks such as the flexibility to store data, produce analytics or incorporate customized AI tasks. Every chatbot development system, whether framework or tool, serves a different purpose. Choosing the right one depends on the type of application to build, organizational needs, and the developer’s expertise.

0
0
7690

article-image-8-myths-rpa-robotic-process-automation

Savia Lobo

08 Nov 2017

9 min read

8 Myths about RPA (Robotic Process Automation)

Savia Lobo

08 Nov 2017

9 min read

Many say we are on the cusp of the fourth industrial revolution that promises to blur the lines between the real, virtual and the biological worlds. Amongst many trends, Robotic Process Automation (RPA) is also one of those buzzwords surrounding the hype of the fourth industrial revolution. Although poised to be a $6.7 trillion industry by 2025, RPA is shrouded in just as much fear as it is brimming with potential. We have heard time and again how automation can improve productivity, efficiency, and effectiveness while conducting business in transformative ways. We have also heard how automation and machine-driven automation, in particular, can displace humans and thereby lead to a dystopian world. As humans, we make assumptions based on what we see and understand. But sometimes those assumptions become so ingrained that they evolve into myths which many start accepting as facts. Here is a closer look at some of the myths surrounding RPA. [dropcap]1[/dropcap] RPA means robots will automate processes The term robot evokes in our minds a picture of a metal humanoid with stiff joints that speaks in a monotone. RPA does mean robotic process automation. But the robot doing the automation is nothing like the ones we are used to seeing in the movies. These are software robots that perform routine processes within organizations. They are often referred to as virtual workers/digital workforce complete with their own identity and credentials. They essentially consist of algorithms programmed by RPA developers with an aim to automate mundane business processes. These processes are repetitive, highly structured, fall within a well-defined workflow, consist of a finite set of tasks/steps and may often be monotonous and labor intensive. Let us consider a real-world example here - Automating the invoice generation process. The RPA system will run through all the emails in the system, and download the pdf files containing details of the relevant transactions. Then, it would fill a spreadsheet with the details and maintain all the records therein. Later, it would log on to the enterprise system and generate appropriate invoice reports for each detail in the spreadsheet. Once the invoices are created, the system would then send a confirmation mail to the relevant stakeholders. Here, the RPA user will only specify the individual tasks that are to be automated, and the system will take care of the rest of the process. So, yes, while it is true that RPA involves robots automating processes, it is a myth that these robots are physical entities or that they can automate all processes. [dropcap]2[/dropcap] RPA is useful only in industries that rely heavily on software “Almost anything that a human can do on a PC, the robot can take over without the need for IT department support.” - Richard Bell, former Procurement Director at Averda RPA is a software which can be injected into a business process. Traditional industries such as banking and finance, healthcare, manufacturing etc that have significant tasks that are routine and depend on software for some of their functioning can benefit from RPA. Loan processing and patient data processing are some examples. RPA, however, cannot help with automating the assembly line in a manufacturing unit or with performing regular tests on patients. Even in industries that maintain daily essential utilities such as cooking gas, electricity, telephone services etc RPA can be put to use for generating automated bills, invoices, meter-readings etc. By adopting RPA, businesses irrespective of the industry they belong to can achieve significant cost savings, operational efficiency, and higher productivity. To leverage the benefits of RPA, rather than understanding the SDLC process, it is important that users have a clear understanding of business workflow processes and domain knowledge. Industry professionals can be easily trained on how to put RPA into practice. The bottom line - RPA is not limited to industries that rely heavily on software to exist. But it is true that RPA can be used only in situations where some form of software is used to perform tasks manually. [dropcap]3[/dropcap] RPA will replace humans in most frontline jobs Many organizations employ a large workforce in frontline roles to do routine tasks such as data entry operations, managing processes, customer support, IT support etc. But frontline jobs are just as diverse as the people performing them. Take sales reps for example. They bring new business through their expert understanding of the company’s products, their potential customer base coupled with the associated soft skills. Currently, they spend significant time on administrative tasks such as developing and finalizing business contracts, updating the CRM database, making daily status reports etc. Imagine the spike in productivity if these aspects could be taken off the plates of sales reps and they could just focus on cultivating relationships and converting leads. By replacing human efforts in mundane tasks within frontline roles, RPA can help employees focus on higher value-yielding tasks. In conclusion, RPA will not replace humans in most frontline jobs. It will, however, replace humans in a few roles that are very rule-based and narrow in scope such as simple data entry operators or basic invoice processing executives. In most frontline roles like sales or customer support, RPA is quite likely to change significantly at least in some ways how one sees their job responsibilities. Also, the adoption of RPA will generate new job opportunities around the development, maintenance, and sale of RPA based software. [dropcap]4[/dropcap] Only large enterprises can afford to deploy RPA The cost of implementing and maintaining the RPA software and training employees to use it can be quite high. This can make it an unfavorable business proposition for SMBs with fairly simple organizational processes and cross-departmental considerations. On the other hand, large organizations with higher revenue generation capacity, complex business processes, and a large army of workers can deploy an RPA system to automate high-volume tasks quite easily and recover that cost within a few months. It is obvious that large enterprises will benefit from RPA systems due to the economies of scale offered and faster recovery of investments made. SMBs (Small to medium-sized businesses) can also benefit from RPA to automate their business processes. But this is possible only if they look at RPA as a strategic investment whose cost will be recovered over a longer time period of say 2-4 years. [dropcap]5[/dropcap] RPA adoption should be owned and driven by the organization's IT department The RPA team handling the automation process need not be from the IT department. The main role of the IT department is providing necessary resources for the software to function smoothly. An RPA reliability team which is trained in using RPA tools does not include IT professionals but rather business operations professionals. In simple terms, RPA is not owned by the IT department but by the whole business and is driven by the RPA team. [dropcap]6[/dropcap] RPA is an AI virtual assistant specialized to do a narrow set of tasks An RPA bot performs a narrow set of tasks based on the given data and instructions. It is a system of rule-based algorithms which can be used to capture, process and interpret streams of data, trigger appropriate responses and communicate with other processes. However, it cannot learn on its own - a key trait of an AI system. Advanced AI concepts such as reinforcement learning and deep learning are yet to be incorporated in robotic process automation systems. Thus, an RPA bot is not an AI virtual assistant, like Apple’s Siri, for example. That said, it is not impractical to think that in the future, these systems will be able to think on their own, decide the best possible way to execute a business process and learn from its own actions to improve the system. [dropcap]7[/dropcap] To use the RPA software, one needs to have basic programming skills Surprisingly, this is not true. Associates who use the RPA system need not have any programming knowledge. They only need to understand how the software works on the front-end, and how they can assign tasks to the RPA worker for automation. On the other hand, RPA system developers do require some programming skills, such as knowledge of scripting languages. Today, there are various platforms for developing RPA tools such as UIPath, Blueprism and more, which empower RPA developers to build these systems without any hassle, reducing their coding responsibilities even more. [dropcap]8[/dropcap] RPA software is fully automated and does not require human supervision This is a big myth. RPA is often misunderstood as a completely automated system. Humans are indeed required to program the RPA bots, to feed them tasks for automation and to manage them. The automation factor here lies in aggregating and performing various tasks which otherwise would require more than one human to complete. There’s also the efficiency factor which comes into play - the RPA systems are fast, and almost completely avoid faults in the system or the process that are otherwise caused due to human error. Having a digital workforce in place is far more profitable than recruiting human workforce. Conclusion One of the most talked about areas in terms of technological innovations, RPA is clearly still in its early days and is surrounded by a lot of myths. However, there’s little doubt that its adoption will take off rapidly as RPA systems become more scalable, more accurate and deploy faster. AI, cognitive, and Analytics-driven RPA will take it up a notch or two, and help the businesses improve their processes even more by taking away dull, repetitive tasks from the people. Hype can get ahead of the reality, as we've seen quite a few times - but RPA is an area definitely worth keeping an eye on despite all the hype.

0
0
6199

Guest Contributor

23 Nov 2017

8 min read

Why you should learn Scikit-learn

Guest Contributor

23 Nov 2017

8 min read

Today, machine learning in Python has become almost synonymous with scikit-learn. The "Big Bang" moment for scikit-learn was in 2007 when a gentleman named David Cournapeau decided to write this project as part of Google Summer of Code 2007. Let's take a moment to thank him. Matthieu Brucher later came on board and developed it further as part of his thesis. From that point on, sklearn never looked back. In 2010, the prestigious French research organization INRIA took ownership of the project with great developers like Gael Varoquaux, Alexandre Gramfort et al. starting work on it. Here's the oldest pull request I could find in sklearn’s repository. The title says "we're getting there"! Starting from there to today where sklearn receives funding and support from Google, Telecom ParisTech and Columbia University among others, it surely must’ve been quite a journey. Sklearn is an open source library which uses the BSD license. It is widely used in industry as well as in academia. It is built on Numpy, Scipy and Matplotlib while also having wrappers around various popular libraries such LIBSVM. Sklearn can be used “out of the box” after installation. Can I trust scikit-learn? Scikit-learn, or sklearn, is a very active open source project having brilliant maintainers. It is used worldwide by top companies such as Spotify, booking.com and the like. That it is open source where anyone can contribute might make you question the integrity of the code, but from the little experience I have contributing to sklearn, let me tell you only very high-quality code gets merged. All pull requests have to be affirmed by at least two core maintainers of the project. Every code goes through multiple iterations. While this can be time-consuming for all the parties involved, such regulations ensure sklearn’s compliance with the industry standard at all times. You don’t just build a library that’s been awarded the “best open source library” overnight! How can I use scikit-learn? Sklearn can be used for a wide variety of use-cases ranging from image classification to music recommendation to classical data modeling. Scikit-learn in various industries: In the Image classification domain, Sklearn’s implementation of K-Means along with PCA has been used for handwritten digit classification very successfully in the past. Sklearn has also been used for facial/ faces recognition using SVM with PCA. Image segmentation tasks such as detecting Red Blood Corpuscles or segmenting the popular Lena image into sections can be done using sklearn. A lot of us here use Spotify or Netflix and are awestruck by their recommendations. Recommendation engines started off with the collaborative filtering algorithm. It basically says “if people like me like something, I’ll also most probably like that.” To find out users with similar tastes, a KNN algorithm can be used which is available in sklearn. You can find a good demonstration of how it is used for music recommendation here. Classical data modeling can be bolstered using sklearn. Most people generally start their kaggle competitive journeys with the titanic challenge. One of the better tutorials out there on starting out is by dataquest and generally acts as a good introduction on how to use pandas and sklearn (a lethal combination!) for data science. It uses the robust Logistic Regression, Random Forest and the Ensembling modules to guide the user. You will be able to experience the user-friendliness of sklearn first hand while completing this tutorial. Sklearn has made machine learning literally a matter of importing a package. Sklearn also helps in Anomaly detection for highly imbalanced datasets (99.9% to 0.1% in credit card fraud detection) through a host of tools like EllipticEnvelope and OneClassSVM. In this regard, the recently merged IsolationForest algorithm especially works well in higher dimensional sets and has very high performance. Other than that, sklearn has implementations of some widely used algorithms such as linear regression, decision trees, SVM and Multi Layer Perceptrons (Neural Networks) to name a few. It has around 39 models in the “linear models” module itself! Happy scrolling here! Most of these algorithms can run very fast compared to raw python code since they are implemented in Cython and use Numpy and Scipy (which in-turn use C) for low-level computations. How is sklearn different from TensorFlow/MLllib? TensorFlow is a popular library to implement deep learning algorithms (since it can utilize GPUs). But while it can also be used to implement machine learning algorithms, the process can be arduous. For implementing logistic regression in TensorFlow, you will first have to “build” the logistic regression algorithm using a computational graph approach. Scikit-learn, on the other hand, provides the same algorithm out of the box however with the limitation that it has to be done in memory. Here's a good example of how LogisticRegression is done in Tensorflow. Apache Spark’s MLlib, on the other hand, consists of algorithms which can be used out of the box just like in Sklearn, however, it is generally used when the ML task is to be performed in a distributed setting. If your dataset fits into RAM, Sklearn would be a better choice for the task. If the dataset is massive, most people generally prototype on a small subset of the dataset locally using Sklearn. Once prototyping and experimentation are done, they deploy in the cluster using MLlib. Some sklearn must-knows Scikit-learn can be used for three different kinds of problems in machine learning namely supervised learning, unsupervised learning and reinforcement learning (ahem AlphaGo). Unsupervised learning happens when one doesn’t have ‘y’ labels in their dataset. Dimensionality reduction and clustering are typical examples. Scikit-learn has implementations of variations of the Principal Component Analysis such as SparsePCA, KernelPCA, and IncrementalPCA among others. Supervised learning covers problems such as spam detection, rent prediction etc. In these problems, the ‘y’ tag for the dataset is present. Models such as Linear regression, random forest, adaboost etc. are implemented in sklearn. From sklearn.linear_models import LogisticRegression Clf = LogisticRegression().fit(train_X, train_y) Preds = Clf.predict(test_X) Model evaluation and analysis Cross-validation, grid search for parameter selection and prediction evaluation can be done using the Model Selection and Metrics module which implements functions such as cross_val_score and f1_score respectively among others. They can be used as such: Import numpy as np From model_selection import cross_val_score From sklearn.metrics import f1_score Cross_val_avg = np.mean(cross_val_score(clf, train_X, train_y, scoring=’f1’)) # tune your parameters for better cross_val_score # for model results on a certain classification problem F_measure = f1_score(test_y, preds) Model Saving Simply pickle your model using pickle.save and it is ready to be distributed and deployed! Hence a whole machine learning pipeline can be built easily using sklearn. Finishing Remarks There are many good books out there talking about machine learning, but in context to Python, Sebastian Raschka`s (one of the core developers on sklearn) recently released his book titled “ Python Machine Learning” and it’s in great demand. Another great blog you could follow is Erik Bernhardsson’s blog. Along with writing about machine learning, he also discusses software development and other interesting ideas. Do subscribe to the scikit-learn mailing list as well. There are some very interesting questions posted there and a lot of learnings to take home. The machine learning subreddit also collates information from a lot of different sources and is thus a good place to find useful information. Scikit-learn has revolutionized the machine learning world by making it accessible to everyone. Machine learning is not like black magic anymore. If you use scikit-learn and like it, do consider contributing to sklearn. There is a huge clutter of open issues and PRs on the sklearn GitHub page. Scikit-learn needs contributors! Have a look at this page to start contributing. Contributing to a library is easily the best way to learn it! [author title="About the Author"]Devashish Deshpande started his foray into data science and machine learning in 2015 with an online course when the question of how machines can learn started intriguing him. He pursued more online courses as well as courses in data science during his undergrad. In order to gain practical knowledge he started contributing to open source projects beginning with a small pull request in Scikit-Learn. He then did a summer project with Gensim and delivered workshops and talks at PyCon France and India in 2016. Currently, Devashish works in the data science team at belong.co, India. Here's the link to his GitHub profile.[/author]

0
0
5488

article-image-understanding-sentiment-analysis-and-other-key-nlp-concepts

Sunith Shetty

20 Dec 2017

12 min read

Understanding Sentiment Analysis and other key NLP concepts

Sunith Shetty

20 Dec 2017

12 min read

0
0
5263

Tech Guides - Big Data

Top 5 programming languages for crunching Big Data effectively

Why is Hadoop dying?

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

How to create a strong data science project portfolio that lands you a job

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

Python Data Stack

What does a data science team look like?

15 Useful Python Libraries to make your Data Science tasks Easier

Looking at the different types of Lookup cache

How do Data Structures and Data Models differ?

Trending Topics

When, why and how to use Graph analytics for your big data

Top 4 chatbot development frameworks for developers

8 Myths about RPA (Robotic Process Automation)

Why you should learn Scikit-learn

Understanding Sentiment Analysis and other key NLP concepts