Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Tech Guides - Big Data

50 Articles
article-image-top-5-programming-languages-big-data
Amey Varangaonkar
04 Apr 2018
8 min read
Save for later

Top 5 programming languages for crunching Big Data effectively

Amey Varangaonkar
04 Apr 2018
8 min read
One of the most important decisions that Big Data professionals have to make, especially the ones who are new to the scene or are just starting out, is choosing the best programming languages for big data manipulation and analysis. Understanding the Big Data problem and framing the architecture to solve it is not quite enough these days - the execution needs to be perfect as well, and choosing the right language goes a long way. The best languages for big data In this article, we look at the 5 of the most popularly used - not to mention highly effective - programming languages for developing Big Data solutions. Scala A beautiful crossover of the object-oriented and functional programming paradigms, Scala is fast and robust, and a popular choice of language for many Big Data professionals.The fact that two of the most popular Big Data processing frameworks in Apache Spark and Apache Kafka have been built on top of Scala tells you everything you need to know about the power of Scala. Scala runs on the JVM, which means the codes written in Scala can be easily used within a Java-based Big Data ecosystem. One significant factor that differentiates Scala from Java, though, is that Scala is a lot less verbose in comparison. You can write 100s of lines of confusing-looking Java code in less than 15 lines in Scala. One negative aspect of Scala, though, is its steep learning curve when compared to languages like Go and Python, and this may put off beginners looking to use it. Why use Scala for big data? Fast and robust Suitable for working with Big Data tools like Apache Spark for distributed Big Data processing JVM compliant, can be used in a Java-based ecosystem Python Python has been declared as one of the fastest growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Its general-purpose nature means it can be used across a broad spectrum of use-cases, and Big Data programming is one major area of application. Many libraries for data analysis and manipulation which are increasingly being used in a Big Data framework to clean and manipulate large chunks of data, such as pandas, NumPy, SciPy - are all Python-based. Not just that, most popular machine learning and deep learning frameworks such as scikit-learn, Tensorflow and many more, are also written in Python and are finding increasing application within the Big Data ecosystem. One drawback of using Python, and a reason why it is not a first-class citizen when it comes to Big Data programming yet, is that it’s slow. Although very easy to use, Big Data professionals have found systems built with languages such as Java or Scala faster and more robust to use than the systems built with Python. However, Python makes up for this limitation with other qualities. As Python is primarily a scripting language, interactive coding and development of analytical solutions for Big Data becomes very easy. Python can integrate effortlessly with the existing Big Data frameworks such as Apache Hadoop and Apache Spark, allowing you to perform predictive analytics at scale without any problem. Why use Python for big data? General-purpose Rich libraries for data analysis and machine learning Easy to use Supports iterative development Rich integration with Big Data tools Interactive computing through Jupyter notebooks R It won’t come as a surprise to many that those who love statistics, love R. The ‘language of statistics’ as it is popularly called as, R is used to build data models which can be used for effective and accurate data analysis. Powered by a large repository of R packages (CRAN, also called as Comprehensive R Archive Network), with R you have just about every type of tool to accomplish any task in Big Data processing - right from analysis to data visualization. R can be integrated seamlessly with Apache Hadoop and Apache Spark, among other popular frameworks, for Big Data processing and analytics. One issue with using R as a programming language for Big Data is that it is not very general-purpose. It means the code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java. That said, if your goal is to only build statistical models for Big Data analytics, R is an option you should definitely consider. Why use R for big data? Built for data science Support for Hadoop and Spark Strong statistical modeling and visualization capabilities Support for Jupyter notebooks Java Last, but not the least, there’s always the good old Java. Some of the traditional Big Data frameworks such as Apache Hadoop and all the tools within its ecosystem are all Java-based, and still in use today in many enterprises. Not to mention the fact that Java is the most stable and production-ready language among all the languages we have discussed so far! Using Java to develop your Big Data applications gives you the ability to use a large ecosystem of tools and libraries for interoperability, monitoring and much more, most of which have already been tried and tested. One major drawback of Java is its verbosity. The fact that you have to write hundreds of lines of codes in Java for a task which can written in barely 15-20 lines of code in Python or Scala, can turnoff many budding programmers. However, the introduction of lambda functions in Java 8 does make life quite easier. Java also does not support iterative development unlike newer languages like Python, and this is an area of focus for the future Java releases. Despite the flaws, Java remains a strong contender when it comes to the preferred language for Big Data programming because of its history and the continued reliance on the traditional Big Data tools and frameworks. Why use Java for big data? Traditional Big Data tools and frameworks are written in Java Stable and production-ready Large ecosystem of tried and tested tools and libraries Go Last but not the least, there’s Go - one of the fastest rising programming languages in recent times. Designed by a group of Google engineers who were frustrated with C++, we think Go is a good shout in this list - simply because of the fact that it powers so many tools used in the Big Data infrastructure, including Kubernetes, Docker and many more. Go is fast, easy to learn, and fairly easy to develop applications with, not to mention deploy them. More importantly, as businesses look at building data analysis systems that can operate at scale, Go-based systems are being used to integrate machine learning and parallel processing of data. It is also possible to interface other languages with Go-based systems with relative ease. Why use Go for big data? Fast, easy to use Many tools used in the Big Data infrastructure are Go-based Efficient distributed computing There are a few other languages you might want to consider - Julia, SAS and MATLAB being some major ones which are useful in their own right. However, when compared to the languages we talked about above, we thought they fell a bit short in some aspects - be it speed, efficiency, ease of use, documentation, or community support, among other things. Let’s take a quick look at the comparison table of all the languages we discussed above. Note that we have used the ✓ symbol for the best possible language/s to help you make an informed decision. This is just our view, and that’s not to say that the other languages are any worse! Scala Python R Java Go Speed ✓ ✓ ✓ Ease of use ✓ ✓ ✓ Quick Learning curve ✓ ✓ Data Analysis capability ✓ ✓ ✓ General-purpose ✓ ✓ ✓ ✓ Big Data support ✓ ✓ ✓ ✓ ✓ Interfacing with other languages ✓ ✓ ✓ Production-ready ✓ ✓ ✓ So...which language should you choose? To answer the question in short - it all depends on the use-case you want to develop. If your focus is hardcore data analysis which involves a lot of statistical computing, R would be your go-to language. On the other hand, if you want to develop streaming applications for your Big Data, Scala can be a preferable choice. If you wish to use Machine Learning to leverage your Big Data and build predictive models, Python will come to your rescue. Lastly, if you plan to build Big Data solutions using just the traditionally-available tools, Java is the language for you. You also have the option of combining the power of two languages to get a more efficient and powerful solution. For example, you can train your machine learning model in Python and deploy it on Spark in a distributed mode. Ultimately, it all depends on how efficiently your solution can function, and more importantly, how fast and accurate it is. Which language do you prefer for crunching your Big Data? Do let us know!
Read more
  • 0
  • 1
  • 17264

article-image-why-is-hadoop-dying
Aaron Lazar
23 Apr 2018
5 min read
Save for later

Why is Hadoop dying?

Aaron Lazar
23 Apr 2018
5 min read
Hadoop has been the definitive big data platform for some time. The name has practically been synonymous with the field. But while its ascent followed the trajectory of what was referred to as the 'big data revolution', Hadoop now seems to be in danger. The question is everywhere - is Hadoop dying out? And if it is, why is it? Is it because big data is no longer the buzzword it once was, or are there simply other ways of working with big data that have become more useful? Hadoop was essential to the growth of big data When Hadoop was open sourced in 2007, it opened the door to big data. It brought compute to data, as against bringing data to compute. Organisations had the opportunity to scale their data without having to worry too much about the cost. It obviously had initial hiccups with security, the complexity of querying and querying speeds, but all that was taken care off, in the long run. Still, although querying speeds remained quite a pain, however that wasn’t the real reason behind Hadoop dying (slowly). As cloud grew, Hadoop started falling One of the main reasons behind Hadoop's decline in popularity was the growth of cloud. There cloud vendor market was pretty crowded, and each of them provided their own big data processing services. These services all basically did what Hadoop was doing. But they also did it in an even more efficient and hassle-free way. Customers didn't have to think about administration, security or maintenance in the way they had to with Hadoop. One person’s big data is another person’s small data Well, this is clearly a fact. Several organisations that used big data technologies without really gauging the amount of data they actually would need to process, have suffered. Imagine sitting with 10TB Hadoop clusters when you don’t have that much data. The two biggest organisations that built products on Hadoop, Hortonworks and Cloudera, saw a decline in revenue in 2015, owing to their massive use of Hadoop. Customers weren’t pleased with nature of Hadoop’s limitations. Apache Hadoop v Apache Spark Hadoop processing is way behind in terms of processing speed. In 2014 Spark took the world by storm. I’m going to let you guess which line in the graph above might be Hadoop, and which might be Spark. Spark was a general purpose, easy to use platform that was built after studying the pitfalls of Hadoop. Spark was not bound to just the HDFS (Hadoop Distributed File System) which meant that it could leverage storage systems like Cassandra and MongoDB as well. Spark 2.3 was also able to run on Kubernetes; a big leap for containerized big data processing in the cloud. Spark also brings along GraphX, which allows developers to view data in the form of graphs. Some of the major areas Spark wins are Iterative Algorithms in Machine Learning, Interactive Data Mining and Data Processing, Stream processing, Sensor data processing, etc. Machine Learning in Hadoop is not straightforward Unlike MLlib in Spark, Machine Learning is not possible in Hadoop unless tied with a 3rd party library. Mahout used to be quite popular for doing ML on Hadoop, but its adoption has gone down in the past few years. Tools like RHadoop, a collection of 3 R packages, have grown for ML, but it still is nowhere comparable to the power of the modern day MLaaS offerings from cloud providers. All the more reason to move away from Hadoop, right? Maybe. Hadoop is not only Hadoop The general misconception is that Hadoop is quickly going to be extinct. On the contrary, the Hadoop family consists of YARN, HDFS, MapReduce, Hive, Hbase, Spark, Kudu, Impala, and 20 other products. While e folks may be moving away from Hadoop as their choice for big data processing, they will still be using Hadoop in some form or the other. As with Cloudera and Hortonworks, though the market has seen a downward trend, they’re in no way letting go of Hadoop anytime soon, although they have shifted part of their processing operations to Spark. Is Hadoop dying? Perhaps not... In the long run, it’s not completely accurate to say that Hadoop is dying. December last year brought with it Hadoop 3.0, which is supposed to be a much improved version of the framework. Some of the most noteworthy features are its improved shell script, more powerful YARN, improved fault tolerance with erasure coding, and many more. Although, that hasn’t caused any major spike in adoption, there are still users who will adopt Hadoop based on their use case, or simply use another alternative like Spark along with another framework from the Hadoop family. So, Hadoop’s not going away anytime soon. Read More Pandas is an effective tool to explore and analyze data - Interview insights  
Read more
  • 0
  • 1
  • 13946

article-image-healthcare-analytics-logistic-regression-to-reduce-patient-readmissions
Guest Contributor
20 Dec 2017
8 min read
Save for later

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

Guest Contributor
20 Dec 2017
8 min read
[box type="info" align="" class="" width=""]We bring to you another guest post by Benjamin Rojogan on Logistic regression to aid healthcare sector in reducing patient readmission. Ben's previous post on ensemble methods to optimize machine learning models is also available for a quick read here.[/box] ER visits are not cheap for any party involved. Whether this be the patient or the insurance company. However, this does not stop some patients from being regular repeat visitors. These recurring visits are due to lack of intervention for problems such as substance abuse, chronic diseases and mental illness. This increases costs for everybody in the healthcare system and reduces quality of care by playing a role in the overflowing of Emergency Departments (EDs). Research teams at UW and other universities are partnering with companies like Kensci to figure out how to approach the problem of reducing readmission rates. The ability to predict the likelihood of a patient’s readmission will allow for targeted intervention which in turn will help reduce the frequency of readmissions. Thus making the population healthier and hopefully reducing the estimated 41.3 billion USD healthcare costs for the entire system. How do they plan to do it? With big data and statistics, of course. A plethora of algorithms are available for data scientists to use to approach this problem. Many possible variables could affect the readmission and medical costs. Also, there are also many different ways researchers might pose their questions. However, the researchers at UW and many other institutions have been heavily focused on reducing the readmission rate simply by trying to calculate whether a person would or would not be readmitted. In particular, this team of researchers was curious about chronic ailments. Patients with chronic ailments are likely to have random flare ups that require immediate attention. Being able to predict if a patient will have an ER visit can lead to managing the cause more effectively. One approach taken by the data science team at UW as well as the Department of Family and Community Medicine at the University of Toronto was to utilize logistic regression to predict whether or not a patient would be readmitted. Patient readmission can be broken down into a binary output: either the patient is readmitted or not. As such logistic regression has been a useful model in my experience to approach this problem. Logistic Regression to predict patient readmissions Why do data scientists like to use logistic regression? Where is it used? And how does it compare to other data algorithms? Logistic regression is a statistical method that statisticians and data scientists use to classify people, products, entities, etc. It is used for analyzing data that produces a binary classification based on one or many independent variables. This means, it produces two clear classifications (Yes or No, 1 or 0, etc). With the example above, the binary classification would be: is the patient readmitted or not? Other examples of this could be whether to give a customer a loan or not, whether a medical claim is fraud or not, whether a patient has diabetes or not. Despite its name, logistic regression does not provide the same output like linear regression (per se). There are some similarities, for instance, the linear model is somewhat consistent as you might notice in the equation below where you see what is very similar to a linear equation. But the final output is based on the log odds. Linear regression and multivariate regression both take one to many independent variables and produce some form of continuous function. Linear regression could be used to predict the price of a house, a person’s age or the cost of a product an e-commerce should display to each customer. The output is not limited to only a few discrete classifications. Whereas logistic regression produces discrete classifiers. For instance, an algorithm using logistic regression could be used to classify whether or not a certain stock price would be either >$50 a share or <$50 a share. Linear regression would be used to predict if a stock share would be worth $50.01, $50.02….etc. Logistic regression is a calculation that uses the odds of a certain classification. In the equation above, the symbol you might know as pi actually represents the odds or probability. To reduce the error rate, we should predict Y = 1 when p ≥ 0.5 and Y = 0 when p < 0.5. This creates a linear classifier, a boundary that when the coefficients β0 + x · β has a p value that is p < 0.5 then Y = 0. By generating coefficients that help predict the logit transformation, the method allows to classify for the characteristic of interest. Now that is a lot of complex math mumbo jumbo. Let’s try to break it down into simpler terms. Probability vs. Odds Let’s start with probability. Let’s say a patient has the probability of 0.6 of being readmitted. Then the probability that the patient won’t be readmitted is .4. Now, we want to take this and convert it into odds. This is what the formula above is doing. You would take .6/.4 and get odds of 1.5. That means the odds of the patient being readmitted are 1.5 to 1. If instead the probability was .5 for both being readmitted and not being readmitted, then the odds would be 1:1. Now the next step in the logistic regression model would be to take the odds and get the “Log odds”. You do this by taking the 1.5 and put it into the log portion of the equation. Now you will get .18(rounded). In logistic regression, we don’t actually know p. That is what we are trying to essentially find and model using various coefficients and input variables. Each input provides a value that changes how much more likely an event will or will not occur. All of these coefficients are used to calculate the log odds. This model can take multiple variables like age, sex, height, etc. and specify how much of an effect each variable has on the odds an event will occur. Once the initial model is developed, then comes the work of deciding its value. How does a business go from creating an algorithm inside a computer and translate it into action. Some of us like to say the “computers” are the easy part. Personally I find the hard part to be the “people”. After all, at the end of the day, it comes down to business value. Will an algorithm save money or not? That means it has to be applied in real life. This could take the form of a new initiative, strategy, product recommendation, etc. You need to find the outliers that are worth going after! For instance, if we go back to the patient readmission example again. The algorithm points out patients with high probabilities of being readmitted. However if the readmission costs are low, they will probably be ignored..sadly. That is how businesses (including hospitals) look at problems. Logistic regression is a great tool for binary classification. It is unlike many other algorithms that estimate continuous variables or estimate distributions. This statistical method can be utilized to classify whether a person will be likely to get cancer because of environmental variables like proximity to a highway, smoking habits, etc? This method has been used effectively in the medical, financial and insurance industry successfully for a while. Knowing when to use what algorithm takes time. However, the more problems a data scientist faces, the faster they will recognize whether to use logistic regression or decision trees. Using logistic regression provides the opportunity for healthcare institutions to accurately target at risk individuals who should receive a more tailored behavioral health plan to help improve their daily health habits. This in turn opens the opportunity for better health for patients and lower costs for hospitals. [box type="shadow" align="" class="" width=""] About the Author Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/box]
Read more
  • 0
  • 0
  • 13412

article-image-create-strong-data-science-project-portfolio-lands-job
Aaron Lazar
13 Feb 2018
8 min read
Save for later

How to create a strong data science project portfolio that lands you a job

Aaron Lazar
13 Feb 2018
8 min read
Okay, you’re probably here because you’ve got just a few months to graduate and the projects section of your resume is blank. Or you’re just an inquisitive little nerd scraping the WWW for ways to crack that dream job. Either way, you’re not alone and there are ten thousand others trying to build a great Data Science portfolio to land them a good job. Look no further, we’ll try our best to help you on how to make a portfolio that catches the recruiter’s eye! David “Trent” Salazar‘s portfolio is a great example of a wholesome one and Sajal Sharma’s, is a good example of how one can display their Data Science Portfolios on a platform like Github. Companies are on the lookout for employees who can add value to the business. To showcase this on your resume effectively, the first step is to understand the different ways in which you can add value. 4 things you need to show in a data science portfolio Data science can be broken down into 4 broad areas: Obtaining insights from data and presenting them to the business leaders Designing an application that directly benefits the customer Designing an application or system that directly benefits other teams in the organisation Sharing expertise on data science with other teams You’ll need to ensure that your portfolio portrays all or at least most of the above, in order to easily make it through a job selection. So let’s see what we can do to make a great portfolio. Demonstrate that you know what you're doing So the idea is to show the recruiter that you’re capable of performing the critical aspects of Data Science, i.e. import a data set, clean the data, extract useful information from the data using various techniques, and finally visualise the findings and communicate them. Apart from the technical skills, there are a few soft skills that are expected as well. For instance, the ability to communicate and collaborate with others, the ability to reason and take the initiative when required. If your project is actually able to communicate these things, you’re in! Stay focused and be specific You might know a lot, but rather than throwing all your skills, projects and knowledge in the employer’s face, it’s always better to be focused on doing something and doing it right. Just as you’d do in your resume, keeping things short and sweet, you can implement this while building your portfolio too. Always remember, the interviewer is looking for specific skills. Research the data science job market Find 5-6 jobs, probably from Linkedin or Indeed, that interest you and go through their descriptions thoroughly. Understand what kind of skills the employer is looking for. For example, it could be classification, machine learning, statistical modeling or regression. Pick up the tools that are required for the job - for example, Python, R, TensorFlow, Hadoop, or whatever might get the job done. If you don’t know how to use that tool, you’ll want to skill-up as you work your way through the projects. Also, identify the kind of data that they would like you to be working on, like text or numerical, etc. Now, once you have this information at hand, start building your project around these skills and tools. Be a problem solver Working on projects that are not actual ‘problems’ that you’re solving, won’t stand out in your portfolio. The closer your projects are to the real-world, the easier it will be for the recruiter to make their decision to choose you. This will also showcase your analytical skills and how you’ve applied data science to solve a prevailing problem. Put at least 3 diverse projects in your data science portfolio A nice way to create a portfolio is to list 3 good projects that are diverse in nature. Here are some interesting projects to get you started on your portfolio: Data Cleaning and wrangling Data Cleaning is one of the most critical tasks that a data scientist performs. By taking a group of diverse data sets, consolidating and making sense of them, you’re giving the recruiter confidence that you know how to prep them for analysis. For example, you can take Twitter or Whatsapp data and clean it for analysis. The process is pretty simple; you first find a “dirty” data set, then spot an interesting angle to approach the data from, clean it up and perform analysis on it, and finally present your findings. Data storytelling Storytelling showcases not only your ability to draw insight from raw data, but it also reveals how well you’re able to convey the insights to others and persuade them. For example, you can use data from the bus system in your country and gather insights to identify which stops incur the most delays. This could be fixed by changing their route. Make sure your analysis is descriptive and your code and logic can be followed. Here’s what you do; first you find a good dataset, then you explore the data and spot correlations in the data. Then you visualize it before you start writing up your narrative. Tackle the data from various angles and pick up the most interesting one. If it’s interesting to you, it will most probably be interesting to anyone else who’s reviewing it. Break down and explain each step in detail, each code snippet, as if you were describing it to a friend. The idea is to teach the reviewer something new as you run through the analysis. End to end data science If you’re more into Machine Learning, or algorithm writing, you should do an end-to-end data science project. The project should be capable of taking in data, processing it and finally learning from it, every step of the way. For example, you can pick up fuel pricing data for your city or maybe stock market data. The data needs to be dynamic and updated regularly. The trick for this one is to keep the code simple so that it’s easy to set up and run. You first need to identify a good topic. Understand here that we will not be working with a single dataset, rather you will need to import and parse all the data and bring it under a single dataset yourself. Next, get the training and test data ready to make predictions. Document your code and other findings and you’re good to go. Prove you have the data science skill set If you want to get that job, you’ve got to have the appropriate tools to get the job done. Here’s a list of some of the most popular tools with a link to the right material for you to skill-up: Data science languages There's a number of key languages in data science that are essential. It might seem obvious, but making sure they're on your resume and demonstrated in your portfolio is incredibly important. Include things like: Python R Java Scala SQL Big Data tools If you're applying for big data roles, demonstrating your experience with the key technologies is a must. It not only proves you have the skills, but also shows that you have an awareness of what tools can be used to build a big data solution or project. You'll need: Hadoop, Spark Hive Machine learning frameworks With machine learning so in demand, if you can prove you've used a number of machine learning frameworks, you've already done a lot to impress. Remember, many organizations won't actually know as much about machine learning as you think. In fact, they might even be hiring you with a view to building out this capability. Remember to include: TensorFlow Caffe2 Keras PyTorch Data visualisation tools Data visualization is a crucial component of any data science project. If you can visualize and communicate data effectively, you're immediately demonstrating you're able to collaborate with others and make your insights accessible and useful to the wider business. Include tools like these in your resume and portfolio:  D3.js Excel chart  Tableau  ggplot2 So there you have it. You know what to do to build a decent data science portfolio. It’s really worth attending competitions and challenges. It will not only help you keep up to data and well oiled with your skills, but also give you a broader picture of what people are actually working on and with what tools they’re able to solve problems.
Read more
  • 0
  • 2
  • 11775

article-image-two-popular-data-analytics-methodologies-every-data-professional-should-know-tdsp-crisp-dm
Amarabha Banerjee
21 Dec 2017
7 min read
Save for later

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

Amarabha Banerjee
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This is a book excerpt taken from Advanced Analytics with R and Tableau authored by Jen Stirrup & Ruben Oliva Ramos. This book will help you make quick, cogent, and data driven decisions for your business using advanced analytical techniques on Tableau and R.[/box] Today we explore popular data analytics methods such as Microsoft TDSP Process and the CRISP- DM methodology. Introduction There is an increasing amount of data in the world, and in our databases. The data deluge is not going to go away anytime soon! Businesses risk wasting the useful business value of information contained in databases, unless they are able to excise useful knowledge from the data. It can be hard to know how to get started. Fortunately, there are a number of frameworks in data science that help us to work our way through an analytics project. Processes such as Microsoft Team Data Science Process (TDSP) and CRISP-DM position analytics as a repeatable process that is part of a bigger vision. Why are they important? The Microsoft TDSP Process and the CRISP-DM frameworks are frameworks for analytics projects that lead to standardized delivery for organizations, both large and small. In this chapter, we will look at these frameworks in more detail, and see how they can inform our own analytics projects and drive collaboration between teams. How can we have the analysis shaped so that it follows a pattern so that data cleansing is included? Industry standard methodologies for analytics There are a few main methodologies: the Microsoft TDSP Process and the CRISP-DM Methodology. Ultimately, they are all setting out to achieve the same objectives as an analytics framework. There are differences, of course, and these are highlighted here. CRISP-DM and TDSP focus on the business value and the results derived from analytics projects. Both of these methodologies are described in the following sections. CRISP-DM One common methodology is the CRISP-DM methodology (the modeling agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process model that provides a fluid framework for devising, creating, building, testing, and deploying machine learning solutions. The process is loosely divided into six main phases. The phases can be seen in the following diagram: Initially, the process starts with a business idea and a general consideration of the data. Each stage is briefly discussed in the following sections. Business understanding/data understanding The first phase looks at the machine learning solution from a business standpoint, rather than a technical standpoint. The business idea is defined, and a draft project plan is generated. Once the business idea is defined, the data understanding phase focuses on data collection and familiarity. At this point, missing data may be identified, or initial insights may be revealed. This process feeds back to the business understanding phase. CRISP-DM model — data preparation In this stage, data will be cleansed and transformed, and it will be shaped ready for the modeling phase. CRISP-DM — modeling phase In the modeling phase, various techniques are applied to the data. The models are further tweaked and refined, and this may involve going back to the data preparation phase in order to correct any unexpected issues. CRISP-DM — evaluation The models need to be tested and verified to ensure that they meet the business objectives that were defined initially in the business understanding phase. Otherwise, we may have built models that do not answer the business question. CRISP-DM — deployment The models are published so that the customer can make use of them. This is not the end of the story, however. CRISP-DM — process restarted We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated. CRISP-DM summary CRISP-DM is the most commonly used framework for implementing machine learning projects specifically, and it applies to analytics projects as well. It has a good focus on the business understanding piece. However, one major drawback is that the model no longer seems to be actively maintained. The official site, CRISP-DM.org, is no longer being maintained. Furthermore, the framework itself has not been updated on issues on working with new technologies, such as big data. Big data technologies means that there can be additional effort spend in the data understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of big data sources. Team Data Science Process The TDSP process model provides a dynamic framework to machine learning solutions that have been through a robust process of planning, producing, constructing, testing, and deploying models. Here is an example of the TDSP process: The process is loosely divided into four main phases: Business Understanding Data Acquisition and Understanding Modeling Deployment The phases are described in the following paragraphs. Business understanding The Business understanding process starts with a business idea, which is solved with a machine learning solution. The business idea is defined from the business perspective, and possible scenarios are identified and evaluated. Ultimately, a project plan is generated for delivering the solution. Data acquisition and understanding Following on from the business understanding phase is the data acquisition and understanding phase, which concentrates on familiarity and fact-finding about the data. The process itself is not completely linear; the output of the data acquisition and understanding phase can feed back to the business understanding phase, for example. At this point, some of the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources. From the user's perspective, there may be actions arising from this effort. For example, it may be noted that there is missing data from the dataset, which requires further investigation before the project proceeds further. Modeling In the modeling phase of the TDSP process, the R model is created, built, and verified against the original business question. In light of the business question, the model needs to make sense. It should also add business value, for example, by performing better than the existing solution that was in place prior to the new R model. This stage also involves examining key metrics in evaluating our R models, which need to be tested to ensure that the models meet the original business objectives set out in the initial business understanding phase. Deployment R models are published to production, once they are proven to be a fit solution to the original business question. This phase involves the creation of a strategy for ongoing review of the R model's performance as well as a monitoring and maintenance plan. It is recommended to carry out a recurrent evaluation of the deployed models. The models will live in a fluid, dynamic world of data and, over time, this environment will impact their efficacy. The TDSP process is a cycle rather than a linear process, and it does not finish, even if the model is deployed. It is comprised of a clear structure for you to follow throughout the Data Science process, and it facilitates teamwork and collaboration along the way. TDSP Summary The data science unicorn does not exist; that is, the person who is equally skilled in all areas of data science, right across the board. In order to ensure successful projects where each team player contributes according to their skill set, the Team Data Science Summary is a team-oriented solution that emphasizes teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups, which can include open source technology. If you liked our post, please be sure to check out Advanced Analytics with R and Tableau that shows how to apply various data analytics techniques in R and Tableau across the different stages of a data science project highlighted in this article.  
Read more
  • 0
  • 0
  • 11456

article-image-python-data-stack
Akram Hussain
31 Oct 2014
3 min read
Save for later

Python Data Stack

Akram Hussain
31 Oct 2014
3 min read
The Python programming language has grown significantly in popularity and importance, both as a general programming language and as one of the most advanced providers of data science tools. There are 6 key libraries every Python analyst should be aware of, and they are: 1 - NumPY NumPY: Also known as Numerical Python, NumPY is an open source Python library used for scientific computing. NumPy gives both speed and higher productivity using arrays and metrics. This basically means it's super useful when analyzing basic mathematical data and calculations. This was one of the first libraries to push the boundaries for Python in big data. The benefit of using something like NumPY is that it takes care of all your mathematical problems with useful functions that are cleaner and faster to write than normal Python code. This is all thanks to its similarities with the C language. 2 - SciPY SciPY: Also known as Scientific Python, is built on top of NumPy. SciPy takes scientific computing to another level. It’s an advanced form of NumPy and allows users to carry out functions such as differential equation solvers, special functions, optimizers, and integrations. SciPY can be viewed as a library that saves time and has predefined complex algorithms that are fast and efficient. However, there are a plethora of SciPY tools that might confuse users more than help them. 3 - Pandas Pandas is a key data manipulation and analysis library in Python. Pandas strengths lie in its ability to provide rich data functions that work amazingly well with structured data. There have been a lot of comparisons between pandas and R packages due to their similarities in data analysis, but the general consensus is that it is very easy for anyone using R to migrate to pandas as it supposedly executes the best features of R and Python programming all in one. 4 - Matplotlib Matplotlib is a visualization powerhouse for Python programming, and it offers a large library of customizable tools to help visualize complex datasets. Providing appealing visuals is vital in the fields of research and data analysis. Python’s 2D plotting library is used to produce plots and make them interactive with just a few lines of code. The plotting library additionally offers a range of graphs including histograms, bar charts, error charts, scatter plots, and much more. 5 - scikit-learn scikit-learn is Python’s most comprehensive machine learning library and is built on top of NumPy and SciPy. One of the advantages of scikit-learn is the all in one resource approach it takes, which contains various tools to carry out machine learning tasks, such as supervised and unsupervised learning. 6 - IPython IPython makes life easier for Python developers working with data. It’s a great interactive web notebook that provides an environment for exploration with prewritten Python programs and equations. The ultimate goal behind IPython is improved efficiency thanks to high performance, by allowing scientific computation and data analysis to happen concurrently using multiple third-party libraries. Continue learning Python with a fun (and potentially lucrative!) way to use decision trees. Read on to find out more.
Read more
  • 0
  • 0
  • 10429
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-what-does-a-data-science-team-look-like
Fatema Patrawala
21 Nov 2019
11 min read
Save for later

What does a data science team look like?

Fatema Patrawala
21 Nov 2019
11 min read
Until a couple of years ago, people barely knew the term 'data science' which has now evolved into an extremely popular career field. The Harvard Business Review dubbed data scientist within the data science team as the sexiest job of the 21st century and expert professionals jumped on the data is the new oil bandwagon. As per the Figure Eight Report 2018, which takes the pulse of the data science community in the US, a lot has changed rapidly in the data science field over the years. For the 2018 report, they surveyed approximately 240 data scientists and found out that machine learning projects have multiplied and more and more data is required to power them. Data science and machine learning jobs are LinkedIn's fastest growing jobs. And the internet is creating 2.5 quintillion bytes of data to process and analyze each day. With all these changes, it is evident for data science teams to evolve and change among various organizations. The data science team is responsible for delivering complex projects where system analysis, software engineering, data engineering, and data science is used to deliver the final solution. To achieve all of this, the team does not only have a data scientist or a data analyst but also includes other roles like business analyst, data engineer or architect, and chief data officer. In this post, we will differentiate and discuss various job roles within a data science team, skill sets required and the compensation benefit for each one of them. For an in-depth understanding of data science teams, read the book, Managing Data Science by Kirill Dubovikov, which has interesting case studies on building successful data science teams. He also explores how the team can efficiently manage data science projects through the use of DevOps and ModelOps.  Now let's get into understanding individual data science roles and functions, but before that we take a look at the structure of the team.There are three basic team structures to match different stages of AI/ML adoption: IT centric team structure At times for companies hiring a data science team is not an option, and they have to leverage in-house talent. During such situations, they take advantage of the fully functional in-house IT department. The IT team manages functions like data preparation, training models, creating user interfaces, and model deployment within the corporate IT infrastructure. This approach is fairly limited, but it is made practical by MLaaS solutions. Environments like Microsoft Azure or Amazon Web Services (AWS) are equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy. Microsoft Azure, for instance, supports its users with detailed documentation for a low entry threshold. The documentation helps in fast training and early deployment of models even without an expert data scientists on board. Integrated team structure Within the integrated structure, companies have a data science team which focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure for model deployment. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations. Unlike the IT centric approach, the integrated method requires having an experienced data scientist within the team. This approach ensures better operational flexibility in terms of available techniques. Additionally, the team leverages deeper understanding of machine learning tools and libraries – like TensorFlow or Theano which are specifically for researchers and data science experts. Specialized data science team Companies can also have an independent data science department to build an all-encompassing machine learning applications and frameworks. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are handled by a dedicated data science team. It doesn't necessarily mean that all team members should have a data science background, but they should have technology background with certain service management skills. A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML backed services. Today's most successful Silicon Valley tech operates with specialized data science teams. Additionally they are custom-built and wired for specific tasks to achieve different business goals. For example, the team structure at Airbnb is one of the most interesting use cases. Martin Daniel, a data scientist at Airbnb in this talk explains how the team emphasizes on having an experimentation-centric culture and apply machine learning rigorously to address unique product challenges. Job roles and responsibilities within data science team As discussed earlier, there are many roles within a data science team. As per Michael Hochster, Director of Data Science at Stitch Fix, there are two types of data scientists: Type A and Type B. Type A stands for analysis. Individuals involved in Type A are statisticians that make sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc. Type B stands for building. These individuals use data in production. They're good software engineers with strong programming knowledge and statistics background. They build recommendation systems, personalization use cases, etc. Though it is rare that one expert will fit into a single category. But understanding these data science functions can help make sense of the roles described further. Chief data officer/Chief analytics officer The chief data officer (CDO) role has been taking organizations by storm. A recent NewVantage Partners' Big Data Executive Survey 2018 found that 62.5% of Fortune 1000 business and technology decision-makers said their organization appointed a chief data officer. The role of chief data officer involves overseeing a range of data-related functions that may include data management, ensuring data quality and creating data strategy. He or she may also be responsible for data analytics and business intelligence, the process of drawing valuable insights from data. Even though chief data officer and chief analytics officer (CAO) are two distinct roles, it is often handled by the same person. Expert professionals and leaders in analytics also own the data strategy and how a company should treat its data. It does make sense as analytics provide insights and value to the data. Hence, with a CDO+CAO combination companies can take advantage of a good data strategy and proper data management without losing on quality. According to compensation analysis from PayScale, the median chief data officer salary is $177,405 per year, including bonuses and profit share, ranging from $118,427 to $313,791 annually. Skill sets required: Data science and analytics, programming skills, domain expertise, leadership and visionary abilities are required. Data analyst The data analyst role implies proper data collection and interpretation activities. The person in this job role will ensure that collected data is relevant and exhaustive while also interpreting the results of the data analysis. Some companies also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics. As per Indeed, the average salary for a data analyst is $68,195 per year in the United States. Skill sets required: Programming languages like R, Python, JavaScript, C/C++, SQL. With this critical thinking, data visualization and presentation skills will be good to have. Data scientist Data scientists are data experts who have the technical skills to solve complex problems and the curiosity to explore what problems are needed to be solved. A data scientist is an individual who develops machine learning models to make predictions and is well versed in algorithm development and computer science. This person will also know the complete lifecycle of the model development. A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets. According to Glassdoor, the current U.S. average salary for a data scientist is $118,709. Skills set required: A data scientist will require knowledge of big data platforms and tools like  Seahorse powered by Apache Spark, JupyterLab, TensorFlow and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R. They should also have cloud computing capabilities and knowledge of various cloud platforms like AWS, Microsoft Azure etc.You can also read this post on how to ace a data science interview to know more. Machine learning engineer At times a data scientist is confused with machine learning engineers, but a machine learning engineer is a distinct role that involves different responsibilities. A machine learning engineer is someone who is responsible for combining software engineering and machine modeling skills. This person determines which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is the ML engineer's job. The average machine learning engineer's salary is $146,085 in the US, and is ranked No.1 on the Indeed's Best Jobs in 2019 list. Skill sets required: Machine learning engineers will be required to have expertise in computer science and programming languages like R, Python, Scala, Java etc. They would also be required to have probability techniques, data modelling and evaluation techniques. Data architects and data engineers The data architects and data engineers work in tandem to conceptualize, visualize, and build an enterprise data management framework. The data architect visualizes the complete framework to create a blueprint, which the data engineer can use to build a digital framework. The data engineering role has recently evolved from the traditional software-engineering field.  Recent enterprise data management experiments indicate that the data-focused software engineers are needed to work along with the data architects to build a strong data architecture. Average salary for a data architect in the US ranges from $1,22,000 to $1,29, 000 annually as per a recent LinkedIn survey. Skill sets required: A data architect or an engineer should have a keen interest and experience in programming languages frameworks like HTML5, RESTful services, Spark, Python, Hive, Kafka, and CSS etc. They should have the required knowledge and experience to handle database technologies such as PostgreSQL, MapReduce and MongoDB and visualization platforms such as; Tableau, Spotfire etc. Business analyst A business analyst (BA) basically handles Chief analytics officer's role but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst can bridge the gap. They are responsible for using data analytics to assess processes, determine requirements and deliver data-driven recommendations and reports to executives and stakeholders. BAs engage with business leaders and users to understand how data-driven changes will be implemented to processes, products, services, software and hardware. They further articulate these ideas and balance them against technologically feasible and financially reasonable. The average salary for a business analyst is $75,078 per year in the United States, as per Indeed. Skill sets required: Excellent domain and industry expertise will be required. With this good communication as well as data visualization skills and knowledge of business intelligence tools will be good to have. Data visualization engineer This specific role is not present in each of the data science teams as some of the responsibilities are realized by either a data analyst or a data architect. Hence, this role is only necessary for a specialized data science model. The role of a data visualization engineer involves having a solid understanding of UI development to create custom data visualization elements for your stakeholders. Regardless of the technology, successful data visualization engineers have to understand principles of design, both graphical and more generally user-centered design. As per Payscale, the average salary for a data visualization engineer is $98,264. Skill sets required: A data visualization engineer need to have rigorous knowledge of data visualization methods and be able to produce various charts and graphs to represent data. Additionally they must understand the fundamentals of design principles and visual display of information. To sum it up, a data science team has evolved to create a number of job roles and opportunities, but companies still face challenges in building up the team from scratch and find it hard to figure where to start from. If you are facing a similar dilemma, check out this book, Managing Data Science, written by Kirill Dubovikov. It covers concepts and methodologies to manage and deliver top-notch data science solutions, while also providing guidance on hiring, growing and sustaining a successful data science team. How to learn data science: from data mining to machine learning How to ace a data science interview Data science vs. machine learning: understanding the difference and what it means today 30 common data science terms explained 9 Data Science Myths Debunked
Read more
  • 0
  • 0
  • 10156

article-image-15-useful-python-libraries-to-make-your-data-science-tasks-easier
Amey Varangaonkar
12 Feb 2018
10 min read
Save for later

15 Useful Python Libraries to make your Data Science tasks Easier

Amey Varangaonkar
12 Feb 2018
10 min read
Python has become a big hit in the Data Science community over the last five years. So much so that it is slowly taking over R - the ‘lingua franca of statistics’ - as the preferred choice of tool for many. The recently published Stack Overflow Developer Survey 2018 suggests Python is the next big programming language, and its adoption in the industry is only going to increase. Python’s rise has been staggering, but not really surprising. Its general-purpose nature, coupled with the efficiency and ease of use make it easier for you to build your data science solutions without any hassle. You also have a rich suite of Python libraries available at your disposal for all your Data Science-related tasks - from basic web scraping to something as complex as training deep learning models. In this article, we take a look at some of the most popular and widely used Python libraries and their application areas. Web Scraping Web scraping is a popular information extraction technique from the web using the HTTP protocol, with the help of a web browser. The two most commonly used tools for web scraping are, unsurprisingly, Python-based. 1.Beautiful Soup Beautiful Soup is a popular Python library for extracting information out of the HTML and XML files. It provides a unique, easy way to navigate, search and modify the parsed data, potentially saving you hours of needless work. It works with both the versions of Python, i.e. 2.7 and 3.x and is very easy to use. Check out our latest tutorial on how to scrape web page using the Beautiful Soup. [box type="info" align="" class="" width=""]Editor's Tip: If you’re new to the concept of web scraping, Beautiful Soup should be your go-to library. You can learn more about how to use this library more efficiently in our book Python Web Scraping Cookbook [/box] 2.Scrapy Scrapy is a free, open source framework written in Python. Although developed for web scraping, it can also be used as a general web crawler and extract data using different APIs. Following the ‘Don’t Repeat Yourself’ philosophy of frameworks such as Django, Scrapy includes a set of self-contained crawlers, with each of them following specific instructions with a specific objective. [box type="info" align="" class="" width=""]Editor’s tip: To learn how to use Scrapy for your scraping projects, our book Python Web Scraping, Second Edition is definitely worth checking out. [/box] Scientific Computation and Data Analysis Arguably the most common data science tasks, Python proves to be of great worth to data scientists by providing unique libraries for data manipulation and analysis, as well as mathematical computation. 3. NumPy NumPy is the most popular library for scientific computing in Python and is a part of the larger Python stack for scientific computation called SciPy (discussed below). Apart from its uses in linear algebra and other mathematical functions, it can also be used as a multi-dimensional container, or array, of generic data with arbitrary data types. NumPy integrates seamlessly languages such as C/C++ and because of its support for multiple data types, it works well with a variety of databases as well. 4. SciPy SciPy is a Python-based framework containing open source libraries for mathematics, scientific computation and data analysis.  The SciPy library is a collection of algorithms and tools for advanced mathematical computations, statistics and much more. The SciPy stack consists of the following libraries: NumPy - Python package for numerical computation SciPy - One of the core packages of the SciPy stack for signal processing, optimization and advanced statistics matplotlib - Popular Python library for data visualization SymPy - Library for symbolic mathematics and algebra pandas - Python library for data manipulation and analysis iPython -  Interactive console to run Python-based code 5. pandas pandas is a widely used Python package providing data structures and tools for effective data manipulation and analysis. It is a popularly used tool for Quantitative Analysis and finds a lot of application in algorithmic trading and risk analysis. With a large community of dedicated users, pandas is regularly updated to get new API changes, performance updates and bug fixes. This is one library you definitely need to work with to truly realize its power. [box type="info" align="" class="" width=""]Editor's Tip: To get a more hands-on understanding of how to effectively use pandas for data analysis, make sure you check out our highly popular title pandas Cookbook.[/box] Machine Learning and Deep Learning Python trumps all other languages when it comes to implementing efficient machine learning and deep learning models, simply by virtue of its diverse, effective and easy to use set of libraries. It is worth having a look at the experts’ take on why Python is great for machine learning and Artificial Intelligence. In this section, we see some of the most popular and commonly used Python libraries for machine learning and deep learning: 6. Scikit-learn scikit-learn is the most popular Python library for data mining, analysis and machine learning. It is built using the capabilities of NumPy, SciPy and matplotlib, and is commercially usable. You can implement a variety of machine learning techniques such as classification, regression, clustering and more, using scikit-learn. It is very easy to install and has a clean, slick documentation for anyone looking to get started with it. [box type="info" align="" class="" width=""]Editor’s tip: To understand how to use scikit-learn in your machine learning projects, our bestselling book Python Machine Learning, Second Edition is all you need. If you’re looking to specifically master scikit-learn, Mastering Machine Learning with scikit-learn will prove to be a very useful resource. Check it out! [/box] 7. Tensorflow Tensorflow is the popular machine learning library everyone seems to be talking about today. It is a Python-based framework for effective machine learning and deep learning using multiple CPUs or GPUs. Backed by Google, it was initially developed by the research team of Google Brain, and is the widely used framework in the world for machine intelligence. It enjoys the support of a large community of active users and is finding widespread application for advanced machine learning across a multitude of industrial domains - from manufacturing and retail to healthcare and smart cars. If you are interested to know more about Tensorflow, you can quickly check out the tutorial here. [box type="info" align="" class="" width=""]Editor's Tip: Tensorflow being the most popular framework for machine learning and deep learning, it is one library you should definitely master. Check out the following books to skill up quickly! Machine Learning with TensorFlow 1.x TensorFlow Machine Learning Cookbook Deep Learning with TensorFlow Tensorflow 1.x Deep Learning Cookbook Mastering Tensorflow 1.x [/box] 8. Keras Keras is a Python-based neural networks API, and offers a simplified interface to train and deploy your deep learning models with ease. It has support for a variety of deep learning frameworks such as Tensorflow, Deeplearning4j and CNTK. Keras is very user-friendly, follows a modular approach and supports both CPU and GPU-based computations. If you want to make the deep learning process simpler and effective, this library is definitely worth checking out! [box type="info" align="" class="" width=""]Editor's Tip: If you’re looking for a resource that teaches you how to use Keras effectively, our trending book Deep Learning with Keras will be of great help to you! [/box] 9. PyTorch One of the more recent additions to Python deep learning family is PyTorch, a neural network modeling library with strong GPU support. Although still in a beta stage, this project is backed by bigwigs such as Facebook and Twitter. PyTorch builds on the architecture of Torch, another popular deep library, to enable more efficient tensor computation and implementation of dynamic neural networks. [box type="info" align="" class="" width=""]Editor's Tip: Here is Deep Learning with PyTorch to get you started with this amazing tool. [/box] Natural Language Processing Natural Language Processing pertains to designing of systems that process, interpret and analyze human language, spoken or written. Python offers unique libraries for performing a variety of tasks such as working with structured and unstructured text, predictive analytics and much more. 10. NLTK NLTK is a popular Python library for language processing. It offers easy to use interfaces for a variety of NLP tasks such as text classification, tokenization, text parsing, semantic reasoning and much more. It is an open source, community-driven project, and has support for both Python 2 and Python 3. 11. SpaCy SpaCy is another library for advanced natural language processing, based on Python and Cython. It has an extensive support for various deep learning libraries and frameworks such as Tensorflow and PyTorch. With SpaCy, you can build complex statistical models for NLP with relative ease. SpaCy is easy to install and use, and proves to be of great help when it comes to large-scale extracting and analyzing of textual information. [box type="info" align="" class="" width=""]Editor's Tip: To know more about how these libraries are used for natural language processing, make sure you check out the book Natural Language Processing with Python Cookbook [/box] Data Visualization Data visualization is a popularly used Data Science technique for visually analysing and communicating information and valuable business insights through graphs, charts, dashboards and reports. Python offers a lot of popular libraries for effective data storytelling. Some of them are listed below: 12. matplotlib matplotlib is the most popular Python library for data visualization which allows for enterprise-grade 2D and 3D plotting. With matplotlib, you can build different kinds of visualizations such as histograms, bar charts, scatter plots and much more, with just a few lines of code. The popularity of matplotlib rivals that of R’s highly acclaimed ggplot2, and deciding which library is better has been a hot topic for debate, for many years now. Matplotlib runs seamlessly on all Python consoles, including iPython and Jupyter notebooks, giving you all the necessary tools to create and share your data visualizations with others. [box type="info" align="" class="" width=""]Editor's Tip: Get started with matplotlib today, with the help of Matplotlib 2.x By Example [/box] 13. Seaborn Seaborn is a Python-based data visualization library, which finds its roots in matplotlib. Apart from offering attractive and insightful data visualizations, seaborn also offers strong support for other Python libraries such as NumPy and pandas. Per the official seaborn page: “If matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make a well-defined set of hard things easy too.” 14. Bokeh Bokeh is an interactive data visualization library based on Python. It aims to provide D3.js style elegant graphics and visualizations and runs primarily on modern web browsers. Apart from the ability to create a wide variety of visualizations, Bokeh also supports large-scale interactivity and visualizations of real-time datasets. 15. Plotly Plotly is a popularly used Python library which is used across the world for making publication-quality plots and graphs. With Plotly, you can build interactive dashboards, scatter plots, histograms, candlestick charts, heat maps, and a whole host of other data visualizations with ease. With superior interactivity, deployment and publication capabilities, Plotly is used across different domains, majorly finance and geospatial industries for effective data storytelling. So there you have it! Python has an extensive suite of libraries for every data science related task, each equipped with unique features to make the task fast and hassle-free. While there are a lot more Python libraries out there, we cherry-picked these 15 libraries based on their popularity, usefulness and the value they bring to the table. Also, the extensive community support for Python means you can get help for any kind of problem you might come across while using these tools. It's time now for you to go out there and crunch some data with some of these Python powered libraries!
Read more
  • 0
  • 0
  • 9052

article-image-looking-different-types-lookup-cache
Savia Lobo
20 Nov 2017
6 min read
Save for later

Looking at the different types of Lookup cache

Savia Lobo
20 Nov 2017
6 min read
[box type="note" align="" class="" width=""]The following is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. We walk through the various types of lookup cache based on how a cache is defined in this article.[/box] Cache is the temporary memory that is created when you execute a process. It is created automatically when a process starts and is deleted automatically once the process is complete. The amount of cache memory is decided based on the property you define at the transformation level or session level. You usually set the property as default, so as required, it can increase the size of the cache. If the size required for caching the data is more than the cache size defined, the process fails with the overflow error. There are different types of caches available. Building the Lookup Cache - Sequential or Concurrent You can define the session property to create the cache either sequentially or concurrently. Sequential cache When you select to create the cache sequentially, Integration Service caches the data in a row-wise manner as the records enter the lookup transformation. When the first record enters the lookup transformation, lookup cache gets created and stores the matching record from the lookup table or file in the cache. This way, the cache stores only the matching data. It helps in saving the cache space by not storing unnecessary data. Concurrent cache When you select to create cache concurrently, Integration service does not wait for the data to flow from the source; it first caches complete data. Once the caching is complete, it allows the data to flow from the source. When you select a concurrent cache, the performance enhances as compared to sequential cache since the scanning happens internally using the data stored in the cache. Persistent cache - the permanent one You can configure the cache to permanently save the data. By default, the cache is created as non-persistent, that is, the cache will be deleted once the session run is complete. If the lookup table or file does not change across the session runs, you can use the existing persistent cache. Suppose you have a process that is scheduled to run every day and you are using lookup transformation to lookup on the reference table that which is not supposed to change for six months. When you use non-persistent cache every day, the same data will be stored in the cache; this will waste time and space every day. If you select to create a persistent cache, the integration service makes the cache permanent in the form of a file in the $PMCacheDir location. So, you save the time every day, creating and deleting the cache memory. When the data in the lookup table changes, you need to rebuild the cache. You can define the condition in the session task to rebuild the cache by overwriting the existing cache. To rebuild the cache, you need to check the rebuild option on the session property. Sharing the cache - named or unnamed You can enhance the performance and save the cache memory by sharing the cache if there are multiple lookup transformations used in a mapping. If you have the same structure for both the lookup transformations, sharing the cache will help in enhancing the performance by creating the cache only once. This way, we avoid creating the cache multiple times, which in turn, enhances the performance. You can share the cache--either named or unnamed Sharing unnamed cache If you have multiple lookup transformations used in a single mapping, you can share the unnamed cache. Since the lookup transformations are present in the same mapping, naming the cache is not mandatory. Integration service creates the cache while processing the first record in first lookup transformation and shares the cache with other lookups in the mapping. Sharing named cache You can share the named cache with multiple lookup transformations in the same mapping or in another mapping. Since the cache is named, you can assign the same cache using the name in the other mapping. When you process the first mapping with lookup transformation, it saves the cache in the defined cache directory and with a defined cache file name. When you process the second mapping, it searches for the same location and cache file and uses the data. If the Integration service does not find the mentioned cache file, it creates the new cache. If you run multiple sessions simultaneously that use the same cache file, Integration service processes both the sessions successfully only if the lookup transformation is configured for read-only from the cache. If there is a scenario when both lookup transformations are trying to update the cache file or a scenario where one lookup is trying to read the cache file and other is trying to update the cache, the session will fail as there is conflict in the processing. Sharing the cache helps in enhancing the performance by utilizing the cache created. This way we save the processing time and repository space by not storing the same data multiple times for lookup transformations. Modifying cache - static or dynamic When you create a cache, you can configure them to be static or dynamic. Static cache A cache is said to be static if it does not change with the changes happening in the lookup table. The static cache is not synchronized with the lookup table. By default, Integration service creates a static cache. The Lookup cache is created as soon as the first record enters the lookup transformation. Integration service does not update the cache while it is processing the data. Dynamic cache A cache is said to be dynamic if it changes with the changes happening in the lookup table. The static cache is synchronized with the lookup table. You can choose from the lookup transformation properties to make the cache dynamic. Lookup cache is created as soon as the first record enters the lookup transformation. Integration service keeps on updating the cache while it is processing the data. The Integration service marks the record as an insert for the new row inserted in the dynamic cache. For the record that is updated, it marks the record as an update in the cache. For every record that doesn't change, the Integration service marks it as unchanged. You use the dynamic cache while you process the slowly changing dimension tables. For every record inserted in the target, the record will be inserted in the cache. For every record updated in the target, the record will be updated in the cache. A similar process happens for the deleted and rejected records.
Read more
  • 0
  • 0
  • 8595

article-image-how-do-data-structures-and-data-models-differ
Amey Varangaonkar
21 Dec 2017
7 min read
Save for later

How do Data Structures and Data Models differ?

Amey Varangaonkar
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Statistics for Data Science, authored by James D. Miller. The book presents interesting techniques through which you can leverage the power of statistics for data manipulation and analysis.[/box] In this article, we will be zooming the spotlight on data structures and data models, and also understanding the difference between both. Data structures Data developers will agree that whenever one is working with large amounts of data, the organization of that data is imperative. If that data is not organized effectively, it will be very difficult to perform any task on that data, or at least be able to perform the task in an efficient manner. If the data is organized effectively, then practically any operation can be performed easily on that data. A data or database developer will then organize the data into what is known as data structures. Following image is a simple binary tree, where the data is organized efficiently by structuring it: A data structure can be defined as a method of organizing large amounts of data more efficiently so that any operation on that data becomes easy. Data structures are created in such a way as to implement one or more particular abstract data type (ADT), which in turn will stipulate what operations can be performed on the data structure, as well as the computational complexity of those operations. [box type="info" align="" class="" width=""]In the field of statistics, an ADT is a model for data types where a data type is defined by its behavior from the point of view (POV) of users of that data, explicitly showing the possible values, the possible operations on data of this type, and the behavior of all of these operations.[/box] Database design is then the process of using the defined data structures to produce a detailed data model, which will become the database. This data model must contain all of the required logical and physical design selections, as well as the physical storage parameters needed to produce a design in a Data Definition Language (DDL), which can then be used to create an actual database. [box type="info" align="" class="" width=""]There are varying degrees of the data model, for example, a fully attributed data model would also contain detailed attributes for each entity in the model.[/box] So, is a data structure a data model? No, a data structure is used to create a data model. Is this data model the same as data models used in statistics? Let's see in the next section. Data models You will find that statistical data models are at the heart of statistical analytics. In the simplest terms, a statistical data model is defined as the following: A representation of a state, process, or system that we want to understand and reason about In the scope of the previous definition, the data or database developer might agree that in theory or in concept, one could use the same terms to define a financial reporting database, as it is designed to contain business transactions and is arranged in data structures that allow business analysts to efficiently review the data, so that they can understand or reason about particular interests they may have concerning the business. Data scientists develop statistical data models so that they can draw inferences from them and, more importantly, make predictions about a topic of concern. Data developers develop databases so that they can similarly draw inferences from them and, more importantly, make predictions about a topic of concern (although perhaps in some organizations, databases are more focused on past and current events (transactions) than on forward-thinking ones (predictions)). Statistical data models come in a multitude of different formats and flavours (as do databases). These models can be equations linking quantities that we can observe or measure; they can also be simply, sets of rules. Databases can be designed or formatted to simplify the entering of online transactions—say, in an order entry system—or for financial reporting when the accounting department must generate a balance sheet, income statement, or profit and loss statement for shareholders. [box type="info" align="" class="" width=""]I found this example of a simple statistical data model: Newton's Second Law of Motion, which states that the net sum of force acting on an object causes the object to accelerate in the direction of the force applied, and at a rate proportional to the resulting magnitude of the force and inversely proportional to the object's mass.[/box] What's the difference? Where or how does the reader find the difference between a data structure or database and a statistical model? At a high level, as we speculated in previous sections, one can conclude that a data structure/database is practically the same thing as a statistical data model, as shown in the following image: At a high level, as we speculated in previous sections, one can conclude that a data structure/database is practically the same thing as a statistical data model. When we take the time to drill deeper into the topic, you should consider the following key points: Although both the data structure/database and the statistical model could be said to represent a set of assumptions, the statistical model typically will be found to be much more keenly focused on a particular set of assumptions concerning the generation of some sample data, and similar data from a larger population, while the data structure/database more often than not will be more broadly based A statistical model is often in a rather idealized form, while the data structure/database may be less perfect in the pursuit of a specific assumption Both a data structure/database and a statistical model are built around relationships between variables The data structure/database relationship may focus on answering certain questions, such as: What are the total orders for specific customers? What are the total orders for a specific customer who has purchased from a certain salesperson? Which customer has placed the most orders? Statistical model relationships are usually very simple, and focused on proving certain questions: Females are shorter than males by a fixed amount Body mass is proportional to height The probability that any given person will partake in a certain sport is a function of age, sex, and socioeconomic status Data structures/databases are all about the act of summarizing data based on relationships between variables Relationships The relationships between variables in a statistical model may be found to be much more complicated than simply straightforward to recognize and understand. An illustration of this is awareness of effect statistics. An effect statistic is one that shows or displays a difference in value to one that is associated with a difference related to one or more other variables. Can you image the SQL query statements you'd use to establish a relationship between two database variables based upon one or more effect statistic? On this point, you may find that a data structure/database usually aims to characterize relationships between variables, while with statistical models, the data scientist looks to fit the model to prove a point or make a statement about the population in the model. That is, a data scientist endeavors to make a statement about the accuracy of an estimate of the effect statistic(s) describing the model! One more note of interest is that both a data structure/database and a statistical model can be seen as tools or vehicles that aim to generalize a population; a database uses SQL to aggregate or summarize data, and a statistical model summarizes its data using effect statistics. The above argument presented the notion that data structures/databases and statistical data models are, in many ways, very similar. If you found this excerpt to be useful, check out the book Statistics for Data Science, which demonstrates different statistical techniques for implementing various data science tasks such as pre-processing, mining, and analysis.  
Read more
  • 0
  • 0
  • 7829
article-image-when-why-and-how-to-use-graph-analytics-for-your-big-data
Sunith Shetty
20 Dec 2017
10 min read
Save for later

When, why and how to use Graph analytics for your big data

Sunith Shetty
20 Dec 2017
10 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. In this book, you will learn how to perform real-time streaming analytics on big data using machine learning algorithms and power of Java. [/box] From the article given below, you will learn why graph analytics is a favourable choice in order to analyze complex datasets. Graph analytics Vs Relational Databases The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. You might ask what is so special about graph analytics that we can’t do by relational databases. Let’s try to understand this using an example, suppose we want to analyze your friends network on Facebook and pull information about your friends such as their name, their birth date, their recent likes, and so on. If Facebook had a relational database, then this would mean firing a query on some table using the foreign key of the user requesting this info. From the perspective of relational database, this first level query is easy. But what if we now ask you to go to the friends at level four in your network and fetch their data (as shown in the following diagram). The query to get this becomes more and more complicated from a relational database perspective but this is a trivial task on a graph or graphical database (such as Neo4j). Graphs are extremely good on operations where you want to pull information from one end of the node to another, where the other node lies after a lot of joins and hops. As such, graph analytics is good for certain use cases (but not for all use cases, relational database are still good on many other use cases): As you can see, the preceding diagram depicts a huge social network (though the preceding diagram might just be depicting a network of a few friends only). The dots represent actual people in a social network. So if somebody asks to pick one user on the left-most side of the diagram and see and follow host connections to the right-most side and pull the friends at the say 10th level or more, this is something very difficult to do in a normal relational database and doing it and maintaining it could easily go out of hand. There are four particular use cases where graph analytics is extremely useful and used frequently (though there are plenty more use cases too): Path analytics: As the name suggests, this analytics approach is used to figure out the paths as you traverse along the nodes of a graph. There are many fields where this can be used—simplest being road networks and figuring out details such as shortest path between cities, or in flight analytics to figure out the shortest time taking flight or direct flights. Connectivity analytics: As the name suggests, this approach outlines how the nodes within a graph are connected to each other. So using this you can figure out how many edges are flowing into a node and how many are flowing out of the node. This kind of information is very useful in analysis. For example, in a social network if there is a person who receives just one message but gives out say ten messages within his network then this person can be used to market his favorite products as he is very good in responding to messages. Community Analytics: Some graphs on big data are huge. But within these huge graphs there might be nodes that are very close to each other and are almost stacked in a cluster of their own. This is useful information as based on this you can extract out communities from your data. For example, in a social network if there are people who are part of some community, say marathon runners, then they can be clubbed into a single community and further tracked. Centrality Analytics: This kind of analytical approach is useful in finding central nodes in a network or graph. This is useful in figuring out sources that are single handedly connected to many other sources. It is helpful in figuring out influential people in a social network, or a central computer in a computer network. From the perspective of this article, we will be covering some of these use cases in our sample case studies and for this we will be using a library on Apache Spark called GraphFrames. GraphFrames GraphX library is advanced and performs well on massive graphs, but, unfortunately, it’s currently only implemented in Scala and does not have any direct Java API. GraphFrames is a relatively new library that is built on top of Apache Spark and provides support for dataframe (now dataset) based graphs. It contains a lot of methods that are direct wrappers over the underlying sparkx methods. As such it provides similar functionality as GraphX except that GraphX acts on the Spark SRDD and GraphFrame works on the dataframe so GraphFrame is more user friendly (as dataframes are simpler to use). All the advantages of firing Spark SQL queries, joining datasets, filtering queries are all supported on this. To understand GraphFrames and representing massive big data graphs, we will take small baby steps first by building some simple programs using GraphFrames before building full-fledged case studies. First, let’s see how to build a graph using Spark and GraphFrames on some sample dataset. Building a graph using GraphFrames Consider that you have as simple graph as shown next. This graph depicts four people Kai, John, Tina, and Alex and the relation they share whether they follow each other or are friends. We will now try to represent this basic graph using the GraphFrame library on top of Apache Spark and in the meantime, we will also start learning the GraphFrame API. Since GraphFrame is a module on top of Spark, let’s first build the Spark configuration and spark sql context for brevity: SparkConfconf= ... JavaSparkContextsc= ... SQLContextsqlContext= ... We will now build the JavaRDD object that will contain the data for our vertices or the people Kai, John, Alex, and Tina in this small network. We will create some sample data using the RowFactory class of Spark API and provide the attributes (ID of the person, and their name and age) that we need per row of the data: JavaRDD<Row>verRow = sc.parallelize(Arrays.asList(RowFactory.create(101L,”Kai”,27),        RowFactory.create(201L,”John”,45),   RowFactory.create(301L,”Alex”,32),        RowFactory.create(401L,”Tina”,23))); Next we will define the structure or schema of the attributes used to build the data. The ID of the person is of type long and the name of the person is a string, and the age of the person is an integer as shown next in the code: List<StructField>verFields = newArrayList<StructField>();  verFields.add(DataTypes.createStructField(“id”,DataTypes.LongType, true));  verFields.add(DataTypes.createStructField(“name”,DataTypes.StringType, true));      verFields.add(DataTypes.createStructField(“age”,DataTypes.IntegerType, true)); Now, let’s build the sample data for the relations between these people and this can basically be represented as the edges of the graph later. This data item of relationship will have the IDs of the persons that are connected together and the type of relationship they share (that is friends or followers). Again we will use the Spark provided RowFactory and build some sample data per row and create the JavaRDD with this data: JavaRDD<Row>edgRow = sc.parallelize(Arrays.asList(        RowFactory.create(101L,301L,”Friends”),        RowFactory.create(101L,401L,”Friends”),        RowFactory.create(401L,201L,”Follow”),        RowFactory.create(301L,201L,”Follow”),        RowFactory.create(201L,101L,”Follow”))); Again, define the schema of the attributes added as part of the edges earlier. This schema is later used in building the dataset for the edges. The attributes passed are the source ID of the node, destination ID of the other node, as well as the relationType, which is a string: List<StructField>EdgFields = newArrayList<StructField>();    EdgFields.add(DataTypes.createStructField(“src”,DataTypes.LongType,true));  EdgFields.add(DataTypes.createStructField(“dst”,DataTypes.LongType,true));  EdgFields.add(DataTypes.createStructField(“relationType”,DataTypes.StringType,true)); Using the schemas that we have defined for the vertices and edges, let’s now build the actual dataset for the vertices and the edges. For this, first create the StructType  object that holds the schema details for the vertices and the edges data and using this structure and the actual data we will next build the dataset of the verticles (verDF) and the dataset for the edges (edgDF): StructTypeverSchema = DataTypes.createStructType(verFields); StructTypeedgSchema = DataTypes.createStructType(EdgFields);    Dataset<Row>verDF = sqlContext.createDataFrame(verRow, verSchema); Dataset<Row>edgDF = sqlContext.createDataFrame(edgRow, edgSchema); Finally, we will now use the vertices and the edges dataset and pass it as a parameter to the GraphFrame constructor and build the GraphFrame instance: GraphFrameg = newGraphFrame(verDF,edgDF); Time has now come to see some mild analytics on the graph we just created. Let’s first visualize our data for the graphs; let’s see the data on the vertices. For this, we will invoke the vertices method on the GraphFrame instance and invoke the standard show method on the generated vertices dataset (GraphFrame would generate a new dataset when the vertices method is invoked). g.vertices().show(); This would print the output as follows: Let’s also see the data on the edges: g.edges().show(); This would print as the output as follows: Let’s also see the number of edges and the number of vertices: System.out.println(“Number of Vertices : “ + g.vertices().count()); System.out.println(“Number of Edges : “ + g.edges().count()); This would print the result as follows: Number of Vertices : 4 Number of Edges : 5 GraphFrame has a handy method to find all the indegrees (out degree or degree) g.inDegrees().show(); This would print the in degrees of all the vertices as shown next: Finally, let’s see one more small thing on this simple graph. As GraphFrames work on the datasets, all the dataset handy methods such as filtering, map, and so on can be applied on them. We will use the filter method and run it on the vertices dataset to figure out the people in the graph with age greater than thirty: g.vertices().filter(“age > 30”).show(); This would print the result as follows: From this post, we learned about graph analytics. We saw how graphs can be built from massive big datasets in order to derive quick insights. You will understand when to implement graph analytics or relational database based on the growing challenges in your organization. To know more about preparing and refining big data and to perform smart data analytics using machine learning algorithms you can refer to the book Big Data Analytics with Java.
Read more
  • 0
  • 0
  • 7789

article-image-top-4-chatbot-development-frameworks-developers
Sugandha Lahoti
20 Oct 2017
8 min read
Save for later

Top 4 chatbot development frameworks for developers

Sugandha Lahoti
20 Oct 2017
8 min read
The rise of the bots is nigh! If you can imagine a situation involving a dialog, there is probably a chatbot for that. Just look at the chatbot market - text-based email/SMS bots, voice-based bots, bots for customer support, transaction-based bots, entertainment bots and many others. A large number of enterprises, from startups to established organizations, are seeking to invest in this sector. This has also led to an increase in the number of platforms used for chatbot building. These frameworks incorporate AI techniques along with natural language processing capabilities to assist developers in building and deploying chatbots. Let’s start with how a chatbot typically works before diving into some of the frameworks. Understand: The first step for any chatbot is to understand the user input. This is made possible using pattern matching and intent classification techniques. ‘Intents’ are the tasks that users might want to perform with a chatbot. Machine learning, NLP and speech recognition techniques are typically used to identify the intent of the message and extract named entities. Entities are the specific pieces of information extracted from the user’s response i.e. the content associated with an intent. Respond: After understanding, the next goal is to generate a response. This is based on the current input message and the context of the conversation. After specifying the intents and entities, a dialog flow is constructed. This is basically the replies/feedback expected from a chatbot. Learn: Chatbots use AI techniques such as natural language understanding and pattern recognition to store and distinguish between the context of the information provided, and elicit a suitable response for future replies. This is important because different requests might have different meanings depending on previous requests. Top chatbot development frameworks A bot development framework is a set of predefined classes, functions, and utilities that a developer can use to build chatbots easier and faster. They vary in the level of complexity, integration capabilities, and functionalities. Let us look at some of the development platforms utilized for chatbot building. API.AI API.AI, a code based framework with a simple web-based interface, allows users to build engaging voice and text-based conversational apps using a large number of libraries and SDKs including Android, iOS, Webkit HTML5, Node.js, and Python API. It also supports nearly 32 one-click platform integrations such as Google, Facebook Messenger, Twitter and Skype to name a few. API.AI makes use of an agent - a container that transforms natural language based user requests into actionable data. The software tries to find the intent behind a user’s reply and matches it to the default or the closest match. After intent matching, it executes the actions and responses the developer has defined for that intent. API.AI also makes use of entities. Once the intents and entities are specified, the bot is trained. API.AI’s training module efficiently tracks each user’s request and lets developers see how they are parsed and matched to an intent. It also allows for correction of any errors and change requests thus retraining the bot. API.AI streamlines the entire bot-creating process by helping developers provide domain-specific knowledge that is unique to a bot’s needs while working on speech recognition, intent and context management in the backend. Google has recently partnered with API.AI to help them build conversational tools like Apple’s Siri. Microsoft Bot Framework Microsoft Bot Framework allows building and deployment of chatbots across multiple platforms and services such as web, SMS, non-Microsoft platforms, Office 365, Skype etc. The Bot Framework includes two components - The Bot Builder and the Microsoft Cognitive Services. The Bot Builder comprises of two full-featured SDKs - for the.NET and the Node.js platforms along with an emulator for testing and debugging. There’s also a set of RESTful APIs for building code in other languages. The SDKs support features for simple and easy interactions between bots. They also have a large collection of prebuilt sample bots for the developer to choose from. The Microsoft Cognitive Services is a collection of intelligent APIs that simplify a variety of AI tasks such as allowing the system to understand and interpret the user's needs using natural language in just a few lines of code. These APIs allow integration to most modern languages and platforms and constantly improve, learn, and get smarter. Microsoft created the AI Inner Circle Partner Program to work hand in hand with industry to create AI solutions. Their only partner in the UK is ICS.AI who build conversational AI solutions for the UK's public sector. ICS are the first choice for many organisations due to their smart solutions that scale and serve to improve services for the general public. Developers can build bots in the Bot Builder SDK using C# or Node.js. They can then add AI capabilities with Cognitive Services. Finally, they can register the bots on the developer portal, connecting it to users across platforms such as Facebook and Microsoft Teams and also deploy it on the cloud like Microsoft Azure. For a step-by-step guide for chatbot building using Microsoft Bot Framework, you can refer to one of our books on the topic. Sabre Corporation, a customer service provider for travel agencies, have recently announced the development of an AI-powered chatbot that leverages Microsoft Bot Framework and Microsoft Cognitive Services. Watson Conversation IBM’s Watson Conversation helps build chatbot solutions that understand natural-language input and use machine learning to respond to customers in a way that simulates conversations between humans. It is built on a neural network of one million Wikipedia words. It offers deployment across a variety of platforms including mobile devices, messaging platforms, and robots. The platform is robust and secure as IBM allows users to opt out of data sharing. The IBM Watson Tone Analyzer service can help bots understand the tone of the user’s input for better management of the experience. The basic steps to create a chatbot using Watson Conversation are as follows. We first create a workspace - a place for configuring information to maintain separate intents, user examples, entities, and dialogues for each application. One workspace corresponds to one bot. Next, we create Intents. Watson Conversation makes use of multiple conditioned responses to distinguish between similar intents. For example, instead of building specific intents for locations of different places, it creates a general intent “location” and adds an entity to capture the response, like the “location- bedroom” - to the right, near the stairs, “location-kitchen”- to the left. The third step is entity establishment. This involves grouping entities that might trigger a similar response in the dialog. The dialog flow, thus generated after specifying the intents and entities, goes through testing followed by embedding this into an application. It is then connected with other services by using the conversation API. Staples, an office supply retailing firm, uses Watson Conversation in their “Easy Systems” to simplify the customer’s shopping experience. CXP Designer and Aspect NLU Aspect Customer Experience Platform is an application lifecycle management tool to build text and voice-based applications such as chatbots. It provides deployment options across multiple communication channels like text, voice, mobile web and social media networks. The Aspect CXP typically includes a CXP designer to build chatbots and the inbuilt Aspect NLU to provide advanced natural language capabilities. CXP designer works by creating dialog objects to provide a menu of options for frontend as well as backend. Menu items for the frontend are used to create intents and modules within those intents. The developer can then modify labels (of those intents and modules) manually or use the Aspect NLU to disambiguate similar questions for successful extraction of meaning and intent. The Aspect NLU includes tools for spelling correction, linguistic lexicons such as nouns, verbs etc. and options for detecting and extracting common data types such as date, time, numbers, etc. It also allows a developer to modify the meaning extraction based on how they want it if they want it! CXP designer also allows skipping of certain steps in chatbots. For instance, if the user has already provided their tracking id for a particular package, the chatbot will skip the prompt of asking them the tracking id again. With Aspect CXP, developers can create and deploy complex chatbots. Radisson Blu Edwardian, a hotel in London, has collaborated with Aspect software to build an SMS based, AI virtual host. Conclusion Another popular chatbot development platform worth mentioning is the Facebook messenger with over 100,000 monthly active bots, but without cross-platform deployment features. The above bot frameworks are typically used by developers to build chatbots from scratch and require some programming skills. However, there has been a rise in automated bot development tools of late. Some of these include Chatfuel and Motion AI and typically involve drag and drop functionalities. With such tools, beginners and non-programmers can create and deploy chatbots within few minutes. But, they lack the extended functionalities supported by typical code based frameworks such as the flexibility to store data, produce analytics or incorporate customized AI tasks. Every chatbot development system, whether framework or tool, serves a different purpose. Choosing the right one depends on the type of application to build, organizational needs, and the developer’s expertise.
Read more
  • 0
  • 0
  • 7690

article-image-8-myths-rpa-robotic-process-automation
Savia Lobo
08 Nov 2017
9 min read
Save for later

8 Myths about RPA (Robotic Process Automation)

Savia Lobo
08 Nov 2017
9 min read
Many say we are on the cusp of the fourth industrial revolution that promises to blur the lines between the real, virtual and the biological worlds. Amongst many trends, Robotic Process Automation (RPA) is also one of those buzzwords surrounding the hype of the fourth industrial revolution. Although poised to be a $6.7 trillion industry by 2025, RPA is shrouded in just as much fear as it is brimming with potential. We have heard time and again how automation can improve productivity, efficiency, and effectiveness while conducting business in transformative ways. We have also heard how automation and machine-driven automation, in particular, can displace humans and thereby lead to a dystopian world. As humans, we make assumptions based on what we see and understand. But sometimes those assumptions become so ingrained that they evolve into myths which many start accepting as facts. Here is a closer look at some of the myths surrounding RPA. [dropcap]1[/dropcap] RPA means robots will automate processes The term robot evokes in our minds a picture of a metal humanoid with stiff joints that speaks in a monotone. RPA does mean robotic process automation. But the robot doing the automation is nothing like the ones we are used to seeing in the movies. These are software robots that perform routine processes within organizations. They are often referred to as virtual workers/digital workforce complete with their own identity and credentials. They essentially consist of algorithms programmed by RPA developers with an aim to automate mundane business processes. These processes are repetitive, highly structured, fall within a well-defined workflow, consist of a finite set of tasks/steps and may often be monotonous and labor intensive. Let us consider a real-world example here - Automating the invoice generation process. The RPA system will run through all the emails in the system, and download the pdf files containing details of the relevant transactions. Then, it would fill a spreadsheet with the details and maintain all the records therein. Later, it would log on to the enterprise system and generate appropriate invoice reports for each detail in the spreadsheet. Once the invoices are created, the system would then send a confirmation mail to the relevant stakeholders. Here, the RPA user will only specify the individual tasks that are to be automated, and the system will take care of the rest of the process. So, yes, while it is true that RPA involves robots automating processes, it is a myth that these robots are physical entities or that they can automate all processes. [dropcap]2[/dropcap] RPA is useful only in industries that rely heavily on software “Almost anything that a human can do on a PC, the robot can take over without the need for IT department support.” - Richard Bell, former Procurement Director at Averda RPA is a software which can be injected into a business process. Traditional industries such as banking and finance, healthcare, manufacturing etc that have significant tasks that are routine and depend on software for some of their functioning can benefit from RPA. Loan processing and patient data processing are some examples. RPA, however, cannot help with automating the assembly line in a manufacturing unit or with performing regular tests on patients. Even in industries that maintain daily essential utilities such as cooking gas, electricity, telephone services etc RPA can be put to use for generating automated bills, invoices, meter-readings etc. By adopting RPA, businesses irrespective of the industry they belong to can achieve significant cost savings, operational efficiency, and higher productivity. To leverage the benefits of RPA, rather than understanding the SDLC process, it is important that users have a clear understanding of business workflow processes and domain knowledge. Industry professionals can be easily trained on how to put RPA into practice. The bottom line - RPA is not limited to industries that rely heavily on software to exist. But it is true that RPA can be used only in situations where some form of software is used to perform tasks manually. [dropcap]3[/dropcap] RPA will replace humans in most frontline jobs Many organizations employ a large workforce in frontline roles to do routine tasks such as data entry operations, managing processes, customer support, IT support etc. But frontline jobs are just as diverse as the people performing them. Take sales reps for example. They bring new business through their expert understanding of the company’s products, their potential customer base coupled with the associated soft skills. Currently, they spend significant time on administrative tasks such as developing and finalizing business contracts, updating the CRM database, making daily status reports etc. Imagine the spike in productivity if these aspects could be taken off the plates of sales reps and they could just focus on cultivating relationships and converting leads. By replacing human efforts in mundane tasks within frontline roles, RPA can help employees focus on higher value-yielding tasks. In conclusion, RPA will not replace humans in most frontline jobs. It will, however, replace humans in a few roles that are very rule-based and narrow in scope such as simple data entry operators or basic invoice processing executives. In most frontline roles like sales or customer support, RPA is quite likely to change significantly at least in some ways how one sees their job responsibilities. Also, the adoption of RPA will generate new job opportunities around the development, maintenance, and sale of RPA based software. [dropcap]4[/dropcap] Only large enterprises can afford to deploy RPA The cost of implementing and maintaining the RPA software and training employees to use it can be quite high. This can make it an unfavorable business proposition for SMBs with fairly simple organizational processes and cross-departmental considerations. On the other hand, large organizations with higher revenue generation capacity, complex business processes, and a large army of workers can deploy an RPA system to automate high-volume tasks quite easily and recover that cost within a few months.   It is obvious that large enterprises will benefit from RPA systems due to the economies of scale offered and faster recovery of investments made. SMBs (Small to medium-sized businesses) can also benefit from RPA to automate their business processes. But this is possible only if they look at RPA as a strategic investment whose cost will be recovered over a longer time period of say 2-4 years. [dropcap]5[/dropcap] RPA adoption should be owned and driven by the organization's IT department The RPA team handling the automation process need not be from the IT department. The main role of the IT department is providing necessary resources for the software to function smoothly. An RPA reliability team which is trained in using RPA tools does not include IT professionals but rather business operations professionals. In simple terms, RPA is not owned by the IT department but by the whole business and is driven by the RPA team. [dropcap]6[/dropcap] RPA is an AI virtual assistant specialized to do a narrow set of tasks An RPA bot performs a narrow set of tasks based on the given data and instructions. It is a system of rule-based algorithms which can be used to capture, process and interpret streams of data, trigger appropriate responses and communicate with other processes. However, it cannot learn on its own - a key trait of an AI system. Advanced AI concepts such as reinforcement learning and deep learning are yet to be incorporated in robotic process automation systems. Thus, an RPA bot is not an AI virtual assistant, like Apple’s Siri, for example. That said, it is not impractical to think that in the future, these systems will be able to think on their own, decide the best possible way to execute a business process and learn from its own actions to improve the system. [dropcap]7[/dropcap] To use the RPA software, one needs to have basic programming skills Surprisingly, this is not true. Associates who use the RPA system need not have any programming knowledge. They only need to understand how the software works on the front-end, and how they can assign tasks to the RPA worker for automation. On the other hand, RPA system developers do require some programming skills, such as knowledge of scripting languages. Today, there are various platforms for developing RPA tools such as UIPath, Blueprism and more, which empower RPA developers to build these systems without any hassle, reducing their coding responsibilities even more. [dropcap]8[/dropcap] RPA software is fully automated and does not require human supervision This is a big myth. RPA is often misunderstood as a completely automated system. Humans are indeed required to program the RPA bots, to feed them tasks for automation and to manage them. The automation factor here lies in aggregating and performing various tasks which otherwise would require more than one human to complete. There’s also the efficiency factor which comes into play - the RPA systems are fast, and almost completely avoid faults in the system or the process that are otherwise caused due to human error. Having a digital workforce in place is far more profitable than recruiting human workforce. Conclusion One of the most talked about areas in terms of technological innovations, RPA is clearly still in its early days and is surrounded by a lot of myths. However, there’s little doubt that its adoption will take off rapidly as RPA systems become more scalable, more accurate and deploy faster. AI, cognitive, and Analytics-driven RPA will take it up a notch or two, and help the businesses improve their processes even more by taking away dull, repetitive tasks from the people. Hype can get ahead of the reality, as we've seen quite a few times - but RPA is an area definitely worth keeping an eye on despite all the hype.
Read more
  • 0
  • 0
  • 6199
article-image-learn-scikit-learn
Guest Contributor
23 Nov 2017
8 min read
Save for later

Why you should learn Scikit-learn

Guest Contributor
23 Nov 2017
8 min read
Today, machine learning in Python has become almost synonymous with scikit-learn. The "Big Bang" moment for scikit-learn was in 2007 when a gentleman named David Cournapeau decided to write this project as part of Google Summer of Code 2007. Let's take a moment to thank him. Matthieu Brucher later came on board and developed it further as part of his thesis. From that point on, sklearn never looked back. In 2010, the prestigious French research organization INRIA took ownership of the project with great developers like Gael Varoquaux, Alexandre Gramfort et al. starting work on it. Here's the oldest pull request I could find in sklearn’s repository. The title says "we're getting there"! Starting from there to today where sklearn receives funding and support from Google, Telecom ParisTech and Columbia University among others, it surely must’ve been quite a journey. Sklearn is an open source library which uses the BSD license. It is widely used in industry as well as in academia. It is built on Numpy, Scipy and Matplotlib while also having wrappers around various popular libraries such LIBSVM. Sklearn can be used “out of the box” after installation. Can I trust scikit-learn? Scikit-learn, or sklearn, is a very active open source project having brilliant maintainers. It is used worldwide by top companies such as Spotify, booking.com and the like. That it is open source where anyone can contribute might make you question the integrity of the code, but from the little experience I have contributing to sklearn, let me tell you only very high-quality code gets merged. All pull requests have to be affirmed by at least two core maintainers of the project. Every code goes through multiple iterations. While this can be time-consuming for all the parties involved, such regulations ensure sklearn’s compliance with the industry standard at all times. You don’t just build a library that’s been awarded the “best open source library” overnight! How can I use scikit-learn? Sklearn can be used for a wide variety of use-cases ranging from image classification to music recommendation to classical data modeling. Scikit-learn in various industries: In the Image classification domain, Sklearn’s implementation of K-Means along with PCA has been used for handwritten digit classification very successfully in the past. Sklearn has also been used for facial/ faces recognition using SVM with PCA. Image segmentation tasks such as detecting Red Blood Corpuscles or segmenting the popular Lena image into sections can be done using sklearn. A lot of us here use Spotify or Netflix and are awestruck by their recommendations. Recommendation engines started off with the collaborative filtering algorithm. It basically says “if people like me like something, I’ll also most probably like that.” To find out users with similar tastes, a KNN algorithm can be used which is available in sklearn. You can find a good demonstration of how it is used for music recommendation here. Classical data modeling can be bolstered using sklearn. Most people generally start their kaggle competitive journeys with the titanic challenge. One of the better tutorials out there on starting out is by dataquest and generally acts as a good introduction on how to use pandas and sklearn (a lethal combination!) for data science. It uses the robust Logistic Regression, Random Forest and the Ensembling modules to guide the user. You will be able to experience the user-friendliness of sklearn first hand while completing this tutorial. Sklearn has made machine learning literally a matter of importing a package. Sklearn also helps in Anomaly detection for highly imbalanced datasets (99.9% to 0.1% in credit card fraud detection) through a host of tools like EllipticEnvelope and OneClassSVM. In this regard, the recently merged IsolationForest algorithm especially works well in higher dimensional sets and has very high performance. Other than that, sklearn has implementations of some widely used algorithms such as linear regression, decision trees, SVM and Multi Layer Perceptrons (Neural Networks) to name a few. It has around 39 models in the “linear models” module itself! Happy scrolling here! Most of these algorithms can run very fast compared to raw python code since they are implemented in Cython and use Numpy and Scipy (which in-turn use C) for low-level computations. How is sklearn different from TensorFlow/MLllib? TensorFlow is a popular library to implement deep learning algorithms (since it can utilize GPUs). But while it can also be used to implement machine learning algorithms, the process can be arduous. For implementing logistic regression in TensorFlow, you will first have to “build” the logistic regression algorithm using a computational graph approach. Scikit-learn, on the other hand, provides the same algorithm out of the box however with the limitation that it has to be done in memory. Here's a good example of how LogisticRegression is done in Tensorflow. Apache Spark’s MLlib, on the other hand, consists of algorithms which can be used out of the box just like in Sklearn, however, it is generally used when the ML task is to be performed in a distributed setting. If your dataset fits into RAM, Sklearn would be a better choice for the task. If the dataset is massive, most people generally prototype on a small subset of the dataset locally using Sklearn. Once prototyping and experimentation are done, they deploy in the cluster using MLlib. Some sklearn must-knows Scikit-learn can be used for three different kinds of problems in machine learning namely supervised learning, unsupervised learning and reinforcement learning (ahem AlphaGo). Unsupervised learning happens when one doesn’t have ‘y’ labels in their dataset. Dimensionality reduction and clustering are typical examples. Scikit-learn has implementations of variations of the Principal Component Analysis such as SparsePCA, KernelPCA, and IncrementalPCA among others. Supervised learning covers problems such as spam detection, rent prediction etc. In these problems, the ‘y’ tag for the dataset is present. Models such as Linear regression, random forest, adaboost etc. are implemented in sklearn. From sklearn.linear_models import LogisticRegression Clf = LogisticRegression().fit(train_X, train_y) Preds = Clf.predict(test_X) Model evaluation and analysis Cross-validation, grid search for parameter selection and prediction evaluation can be done using the Model Selection and Metrics module which implements functions such as cross_val_score and f1_score respectively among others. They can be used as such: Import numpy as np From model_selection import cross_val_score From sklearn.metrics import f1_score Cross_val_avg = np.mean(cross_val_score(clf, train_X, train_y, scoring=’f1’)) # tune your parameters for better cross_val_score # for model results on a certain classification problem F_measure = f1_score(test_y, preds) Model Saving Simply pickle your model using pickle.save and it is ready to be distributed and deployed! Hence a whole machine learning pipeline can be built easily using sklearn. Finishing Remarks There are many good books out there talking about machine learning, but in context to Python,  Sebastian Raschka`s  (one of the core developers on sklearn) recently released his book titled “ Python Machine Learning” and it’s in great demand. Another great blog you could follow is Erik Bernhardsson’s blog. Along with writing about machine learning, he also discusses software development and other interesting ideas. Do subscribe to the scikit-learn mailing list as well. There are some very interesting questions posted there and a lot of learnings to take home. The machine learning subreddit also collates information from a lot of different sources and is thus a good place to find useful information. Scikit-learn has revolutionized the machine learning world by making it accessible to everyone. Machine learning is not like black magic anymore. If you use scikit-learn and like it, do consider contributing to sklearn. There is a huge clutter of open issues and PRs on the sklearn GitHub page. Scikit-learn needs contributors! Have a look at this page to start contributing. Contributing to a library is easily the best way to learn it! [author title="About the Author"]Devashish Deshpande started his foray into data science and machine learning in 2015 with an online course when the question of how machines can learn started intriguing him. He pursued more online courses as well as courses in data science during his undergrad. In order to gain practical knowledge he started contributing to open source projects beginning with a small pull request in Scikit-Learn. He then did a summer project with Gensim and delivered workshops and talks at PyCon France and India in 2016. Currently, Devashish works in the data science team at belong.co, India. Here's the link to his GitHub profile.[/author]
Read more
  • 0
  • 0
  • 5488

article-image-understanding-sentiment-analysis-and-other-key-nlp-concepts
Sunith Shetty
20 Dec 2017
12 min read
Save for later

Understanding Sentiment Analysis and other key NLP concepts

Sunith Shetty
20 Dec 2017
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. This book will help you learn to perform big data analytics tasks using machine learning concepts such as clustering, recommending products, data segmentation and more. [/box] With this post, you will learn what is sentiment analysis and how it is used to analyze emotions associated within the text. You will also learn key NLP concepts such as Tokenization, stemming among others and how they are used for sentiment analysis. What is sentiment analysis? One of the forms of text analysis is sentimental analysis. As the name suggests this technique is used to figure out the sentiment or emotion associated with the underlying text. So if you have a piece of text and you want to understand what kind of emotion it conveys, for example, anger, love, hate, positive, negative, and so on you can use the technique sentimental analysis. Sentimental analysis is used in various places, for example: To analyze the reviews of a product whether they are positive or negative This can be especially useful to predict how successful your new product is by analyzing user feedback To analyze the reviews of a movie to check if it's a hit or a flop Detecting the use of bad language (such as heated language, negative remarks, and so on) in forums, emails, and social media To analyze the content of tweets or information on other social media to check if a political party campaign was successful or not  Thus, sentimental analysis is a useful technique, but before we see the code for our sample sentimental analysis example, let's understand some of the concepts needed to solve this problem. [box type="shadow" align="" class="" width=""]For working on a sentimental analysis problem we will be using some techniques from natural language processing and we will be explaining some of those concepts.[/box] Concepts for sentimental analysis Before we dive into the fully-fledged problem of analyzing the sentiment behind text, we must understand some concepts from the NLP (Natural Language Processing) perspective. We will explain these concepts now. Tokenization From the perspective of machine learning one of the most important tasks is feature extraction and feature selection. When the data is plain text then we need some way to extract the information out of it. We use a technique called tokenization where the text content is pulled and tokens or words are extracted from it. The token can be a single word or a group of words too. There are various ways to extract the tokens, as follows: By using regular expressions: Regular expressions can be applied to textual content to extract words or tokens from it. By using a pre-trained model: Apache Spark ships with a pre-trained model (machine learning model) that is trained to pull tokens from a text. You can apply this model to a piece of text and it will return the predicted results as a set of tokens. To understand a tokenizer using an example, let's see a simple sentence as follows: Sentence: "The movie was awesome with nice songs" Once you extract tokens from it you will get an array of strings as follows: Tokens: ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs'] [box type="shadow" align="" class="" width=""]The type of tokens you extract depends on the type of tokens you are interested in. Here we extracted single tokens, but tokens can also be a group of words, for example, 'very nice', 'not good', 'too bad', and so on.[/box] Stop words removal Not all the words present in the text are important. Some words are common words used in the English language that are important for the purpose of maintaining the grammar correctly, but from conveying the information perspective or emotion perspective they might not be important at all, for example, common words such as is, was, were, the, and so. To remove these words there are again some common techniques that you can use from natural language processing, such as: Store stop words in a file or dictionary and compare your extracted tokens with the words in this dictionary or file. If they match simply ignore them. Use a pre-trained machine learning model that has been taught to remove stop words. Apache Spark ships with one such model in the Spark feature package. Let's try to understand stop words removal using an example: Sentence: "The movie was awesome" From the sentence we can see that common words with no special meaning to convey are the and was. So after applying the stop words removal program to this data you will get: After stop words removal: [ 'movie', 'awesome', 'nice', 'songs'] [box type="shadow" align="" class="" width=""]In the preceding sentence, the stop words the, was, and with are removed.[/box] Stemming Stemming is the process of reducing a word to its base or root form. For example, look at the set of words shown here: car, cars, car's, cars' From our perspective of sentimental analysis, we are only interested in the main words or the main word that it refers to. The reason for this is that the underlying meaning of the word in any case is the same. So whether we pick car's or cars we are referring to a car only. Hence the stem or root word for the previous set of words will be: car, cars, car's, cars' => car (stem or root word) For English words again you can again use a pre-trained model and apply it to a set of data for figuring out the stem word. Of course there are more complex and better ways (for example, you can retrain the model with more data), or you have to totally use a different model or technique if you are dealing with languages other than English. Diving into stemming in detail is beyond the scope of this book and we would encourage readers to check out some documentation on natural language processing from Wikipedia and the Stanford nlp website. [box type="shadow" align="" class="" width=""]To keep the sentimental analysis example in this book simple we will not be doing stemming of our tokens, but we will urge the readers to try the same to get better predictive results.[/box] N-grams Sometimes a single word conveys the meaning of context, other times a group of words can convey a better meaning. For example, 'happy' is a word in itself that conveys happiness, but 'not happy' changes the picture completely and 'not happy' is the exact opposite of 'happy'. If we are extracting only single words then in the example shown before, that is 'not happy', then 'not' and 'happy' would be two separate words and the entire sentence might be selected as positive by the classifier However, if the classifier picks the bi-grams (that is, two words in one token) in this case then it would be trained with 'not happy' and it would classify similar sentences with 'not happy' in it as 'negative'. Therefore, for training our models we can either use a uni-gram or a bi-gram where we have two words per token or as the name suggest an n-gram where we have 'n' words per token, it all depends upon which token set trains our model well and it improves its predictive results accuracy. To see examples of n-grams refer to the following table:   Sentence The movie was awesome with nice songs Uni-gram ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs'] Bi-grams ['The movie', 'was awesome', 'with nice', 'songs'] Tri-grams ['The movie was', 'awesome with nice', 'songs']   For the purpose of this case study we will be only looking at unigrams to keep our example simple. By now we know how to extract words from text and remove the unwanted words, but how do we measure the importance of words or the sentiment that originates from them? For this there are a few popular approaches and we will now discuss two such approaches. Term presence and term frequency Term presence just means that if the term is present we mark the value as 1 or else 0. Later we build a matrix out of it where the rows represent the words and columns represent each sentence. This matrix is later used to do text analysis by feeding its content to a classifier. Term Frequency, as the name suggests, just depicts the count or occurrences of the word or tokens within the document. Let's refer to the example in the following table where we find term frequency:   Sentence The movie was awesome with nice songs and nice dialogues. Tokens (Unigrams only for now) ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs', 'and', 'dialogues'] Term Frequency ['The = 1', 'movie = 1', 'was = 1', 'awesome = 1', 'with = 1', 'nice = 2', 'songs = 1', 'dialogues = 1']   As seen in the preceding table, the word 'nice' is repeated twice in the preceding sentence and hence it will get more weight in determining the opinion shown by the sentence. Bland term frequency is not a precise approach for the following reasons: There could be some redundant irrelevant words, for example, the, it, and they that might have a big frequency or count and they might impact the training of the model There could be some important rare words that could convey the sentiment regarding the document yet their frequency might be low and hence they might not be inclusive for greater impact on the training of the model Due to this reason, a better approach of TF-IDF is chosen as shown in the next sections. TF-IDF TF-IDF stands for Term Frequency and Inverse Document Frequency and in simple terms it means the importance of a term to a document. It works using two simple steps as follows: It counts the number of terms in the document, so the higher the number of terms the greater the importance of this term to the document. Counting just the frequency of words in a document is not a very precise way to find the importance of the words. The simple reason for this is there could be too many stop words and their count is high so their importance might get elevated above the importance of real good words. To fix this, TF-IDF checks for the availability of these stop words in other documents as well. If the words appear in other documents as well in large numbers that means these words could be grammatical words such as they, for, is, and so on, and TF-IDF decreases the importance or weight of such stop words. Let's try to understand TF-IDF using the following figure: As seen in the preceding figure, doc-1, doc-2, and so on are the documents from which we extract the tokens or words and then from those words we calculate the TF-IDFs. Words that are stop words or regular words such as for , is, and so on, have low TF-IDFs, while words that are rare such as 'awesome movie' have higher TF-IDFs. TF-IDF is the product of Term Frequency and Inverse document frequency. Both of them are explained here: Term Frequency: This is nothing but the count of the occurrences of the words in the document. There are other ways of measuring this, but the simplistic approach is to just count the occurrences of the tokens. The simple formula for its calculation is:      Term Frequency = Frequency count of the tokens Inverse Document Frequency: This is the measure of how much information the word provides. It scales up the weight of the words that are rare and scales down the weight of highly occurring words. The formula for inverse document frequency is: TF-IDF: TF-IDF is a simple multiplication of the Term Frequency and the Inverse Document Frequency. Hence: This simple technique is very popular and it is used in a lot of places for text analysis. Next let's look into another simple approach called bag of words that is used in text analytics too. Bag of words As the name suggests, bag of words uses a simple approach whereby we first extract the words or tokens from the text and then push them in a bag (imaginary set) and the main point about this is that the words are stored in the bag without any particular order. Thus the mere presence of a word in the bag is of main importance and the order of the occurrence of the word in the sentence as well as its grammatical context carries no value. Since the bag of words gives no importance to the order of words you can use the TF-IDFs of all the words in the bag and put them in a vector and later train a classifier (naïve bayes or any other model) with it. Once trained, the model can now be fed with vectors of new data to predict on its sentiment. Summing it up, we have got you well versed with sentiment analysis techniques and NLP concepts in order to apply sentimental analysis. If you want to implement machine learning algorithms to carry out predictive analytics and real-time streaming analytics you can refer to the book Big Data Analytics with Java.    
Read more
  • 0
  • 0
  • 5263