Big Data | 50 articles | Tech News, Tutorials & Expert Insights

21 Nov 2019

11 min read

What does a data science team look like?

21 Nov 2019

Until a couple of years ago, people barely knew the term 'data science' which has now evolved into an extremely popular career field. The Harvard Business Review dubbed data scientist within the data science team as the sexiest job of the 21st century and expert professionals jumped on the data is the new oil bandwagon. As per the Figure Eight Report 2018, which takes the pulse of the data science community in the US, a lot has changed rapidly in the data science field over the years. For the 2018 report, they surveyed approximately 240 data scientists and found out that machine learning projects have multiplied and more and more data is required to power them. Data science and machine learning jobs are LinkedIn's fastest growing jobs. And the internet is creating 2.5 quintillion bytes of data to process and analyze each day. With all these changes, it is evident for data science teams to evolve and change among various organizations. The data science team is responsible for delivering complex projects where system analysis, software engineering, data engineering, and data science is used to deliver the final solution. To achieve all of this, the team does not only have a data scientist or a data analyst but also includes other roles like business analyst, data engineer or architect, and chief data officer. In this post, we will differentiate and discuss various job roles within a data science team, skill sets required and the compensation benefit for each one of them. For an in-depth understanding of data science teams, read the book, Managing Data Science by Kirill Dubovikov, which has interesting case studies on building successful data science teams. He also explores how the team can efficiently manage data science projects through the use of DevOps and ModelOps. Now let's get into understanding individual data science roles and functions, but before that we take a look at the structure of the team.There are three basic team structures to match different stages of AI/ML adoption: IT centric team structure At times for companies hiring a data science team is not an option, and they have to leverage in-house talent. During such situations, they take advantage of the fully functional in-house IT department. The IT team manages functions like data preparation, training models, creating user interfaces, and model deployment within the corporate IT infrastructure. This approach is fairly limited, but it is made practical by MLaaS solutions. Environments like Microsoft Azure or Amazon Web Services (AWS) are equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy. Microsoft Azure, for instance, supports its users with detailed documentation for a low entry threshold. The documentation helps in fast training and early deployment of models even without an expert data scientists on board. Integrated team structure Within the integrated structure, companies have a data science team which focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure for model deployment. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations. Unlike the IT centric approach, the integrated method requires having an experienced data scientist within the team. This approach ensures better operational flexibility in terms of available techniques. Additionally, the team leverages deeper understanding of machine learning tools and libraries – like TensorFlow or Theano which are specifically for researchers and data science experts. Specialized data science team Companies can also have an independent data science department to build an all-encompassing machine learning applications and frameworks. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are handled by a dedicated data science team. It doesn't necessarily mean that all team members should have a data science background, but they should have technology background with certain service management skills. A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML backed services. Today's most successful Silicon Valley tech operates with specialized data science teams. Additionally they are custom-built and wired for specific tasks to achieve different business goals. For example, the team structure at Airbnb is one of the most interesting use cases. Martin Daniel, a data scientist at Airbnb in this talk explains how the team emphasizes on having an experimentation-centric culture and apply machine learning rigorously to address unique product challenges. Job roles and responsibilities within data science team As discussed earlier, there are many roles within a data science team. As per Michael Hochster, Director of Data Science at Stitch Fix, there are two types of data scientists: Type A and Type B. Type A stands for analysis. Individuals involved in Type A are statisticians that make sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc. Type B stands for building. These individuals use data in production. They're good software engineers with strong programming knowledge and statistics background. They build recommendation systems, personalization use cases, etc. Though it is rare that one expert will fit into a single category. But understanding these data science functions can help make sense of the roles described further. Chief data officer/Chief analytics officer The chief data officer (CDO) role has been taking organizations by storm. A recent NewVantage Partners' Big Data Executive Survey 2018 found that 62.5% of Fortune 1000 business and technology decision-makers said their organization appointed a chief data officer. The role of chief data officer involves overseeing a range of data-related functions that may include data management, ensuring data quality and creating data strategy. He or she may also be responsible for data analytics and business intelligence, the process of drawing valuable insights from data. Even though chief data officer and chief analytics officer (CAO) are two distinct roles, it is often handled by the same person. Expert professionals and leaders in analytics also own the data strategy and how a company should treat its data. It does make sense as analytics provide insights and value to the data. Hence, with a CDO+CAO combination companies can take advantage of a good data strategy and proper data management without losing on quality. According to compensation analysis from PayScale, the median chief data officer salary is $177,405 per year, including bonuses and profit share, ranging from $118,427 to $313,791 annually. Skill sets required: Data science and analytics, programming skills, domain expertise, leadership and visionary abilities are required. Data analyst The data analyst role implies proper data collection and interpretation activities. The person in this job role will ensure that collected data is relevant and exhaustive while also interpreting the results of the data analysis. Some companies also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics. As per Indeed, the average salary for a data analyst is $68,195 per year in the United States. Skill sets required: Programming languages like R, Python, JavaScript, C/C++, SQL. With this critical thinking, data visualization and presentation skills will be good to have. Data scientist Data scientists are data experts who have the technical skills to solve complex problems and the curiosity to explore what problems are needed to be solved. A data scientist is an individual who develops machine learning models to make predictions and is well versed in algorithm development and computer science. This person will also know the complete lifecycle of the model development. A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets. According to Glassdoor, the current U.S. average salary for a data scientist is $118,709. Skills set required: A data scientist will require knowledge of big data platforms and tools like Seahorse powered by Apache Spark, JupyterLab, TensorFlow and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R. They should also have cloud computing capabilities and knowledge of various cloud platforms like AWS, Microsoft Azure etc.You can also read this post on how to ace a data science interview to know more. Machine learning engineer At times a data scientist is confused with machine learning engineers, but a machine learning engineer is a distinct role that involves different responsibilities. A machine learning engineer is someone who is responsible for combining software engineering and machine modeling skills. This person determines which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is the ML engineer's job. The average machine learning engineer's salary is $146,085 in the US, and is ranked No.1 on the Indeed's Best Jobs in 2019 list. Skill sets required: Machine learning engineers will be required to have expertise in computer science and programming languages like R, Python, Scala, Java etc. They would also be required to have probability techniques, data modelling and evaluation techniques. Data architects and data engineers The data architects and data engineers work in tandem to conceptualize, visualize, and build an enterprise data management framework. The data architect visualizes the complete framework to create a blueprint, which the data engineer can use to build a digital framework. The data engineering role has recently evolved from the traditional software-engineering field. Recent enterprise data management experiments indicate that the data-focused software engineers are needed to work along with the data architects to build a strong data architecture. Average salary for a data architect in the US ranges from $1,22,000 to $1,29, 000 annually as per a recent LinkedIn survey. Skill sets required: A data architect or an engineer should have a keen interest and experience in programming languages frameworks like HTML5, RESTful services, Spark, Python, Hive, Kafka, and CSS etc. They should have the required knowledge and experience to handle database technologies such as PostgreSQL, MapReduce and MongoDB and visualization platforms such as; Tableau, Spotfire etc. Business analyst A business analyst (BA) basically handles Chief analytics officer's role but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst can bridge the gap. They are responsible for using data analytics to assess processes, determine requirements and deliver data-driven recommendations and reports to executives and stakeholders. BAs engage with business leaders and users to understand how data-driven changes will be implemented to processes, products, services, software and hardware. They further articulate these ideas and balance them against technologically feasible and financially reasonable. The average salary for a business analyst is $75,078 per year in the United States, as per Indeed. Skill sets required: Excellent domain and industry expertise will be required. With this good communication as well as data visualization skills and knowledge of business intelligence tools will be good to have. Data visualization engineer This specific role is not present in each of the data science teams as some of the responsibilities are realized by either a data analyst or a data architect. Hence, this role is only necessary for a specialized data science model. The role of a data visualization engineer involves having a solid understanding of UI development to create custom data visualization elements for your stakeholders. Regardless of the technology, successful data visualization engineers have to understand principles of design, both graphical and more generally user-centered design. As per Payscale, the average salary for a data visualization engineer is $98,264. Skill sets required: A data visualization engineer need to have rigorous knowledge of data visualization methods and be able to produce various charts and graphs to represent data. Additionally they must understand the fundamentals of design principles and visual display of information. To sum it up, a data science team has evolved to create a number of job roles and opportunities, but companies still face challenges in building up the team from scratch and find it hard to figure where to start from. If you are facing a similar dilemma, check out this book, Managing Data Science, written by Kirill Dubovikov. It covers concepts and methodologies to manage and deliver top-notch data science solutions, while also providing guidance on hiring, growing and sustaining a successful data science team. How to learn data science: from data mining to machine learning How to ace a data science interview Data science vs. machine learning: understanding the difference and what it means today 30 common data science terms explained 9 Data Science Myths Debunked

0
0
13283

Guest Contributor

08 Sep 2018

8 min read

How to secure your crypto currency

Guest Contributor

08 Sep 2018

8 min read

Managing and earning cryptocurrency is a lot of hassle and losing it is a lot like losing yourself. While security of this blockchain based currency is a major concern, here is what you can do to secure your crypto fortune. With the ever fluctuating crypto-rates, every time, it’s now or never. While Bitcoin climbed up to $17,900 in the past, the digital currency frenzy is always in-trend and its security is crucial. No crypto geek wants to lose their currency due to malicious activities, negligence or any other reason. Before we delve into securing our crypto currencies, lets discuss the structure and strategy of this crypto vault that ensures the absolute security of a blockchain based digital currency. Why blockchains are secure, at least, in theory Below are the three core elements that contribute in making blockchain a fool proof digital technology. Public key cryptography Hashing Digital signatures Public Key Cryptography This cryptography involves two distinctive keys i.e., private and public keys. Both keys decrypt and encrypt data asymmetrically. Both have simultaneous dependency of data which is encrypted by a private key and can only be decrypted with the public key. Similarly, data decrypted by public key can only be decrypted by a private key. Various cryptography schemes including TLS (Transport Layer Security protocol) and SSL (Secure Sockets Layer) have this system at its core. The strategy works well with you putting in your public key into the world of blockchain and keeping your private key confidential, not revealing it on any platform or place. Hashing Also called a digest, the hash of a message gets calculated on the basis of the contents of a message. The hashing algorithm generates a hash that is created deterministically. Data of an arbitrary length acts an input to the hashing algorithm. The outcome of this complex process is known as a calculated amount of hash with a predefined length. Due to its deterministic nature, the input and output are the same. Considering mathematical calculations, it’s easy to convert a message into hash but when it comes to obtaining an original message from hash, it is tediously difficult. Digital Signatures A digital signature is an encrypted form of hash of a message and is an outcome of a private key. Anyone who has the access to the public key can break into the digital signature by decrypting it and this can be used to get the original hash. Anyone who can read the message can calculate the hash of a message on its own. The independently calculated hash can be compared with the decrypted hash to ensure both the hashes are the same. If they both match, it is a confirmation that the message remains unaltered from creation to reception. Additionally, it is a sign of a relating private key digitally signing the message. A hash is extracted from a message and if a message gets altered, it will produce a different type of hash. Note that it is complex to reverse the process to find the message of a hash but it’s easy to compute the hash of a message. A hash that is encrypted by a private key is known as digital signature. Anyone having a public key can decrypt a digital signature and they have the ability to compare the digital signature with a calculated hash of the message. If the value of an original message is active and the message is signed by the entity having the private key, it means that the hashes are identical. What are Crypto wallets and transactions Every crypto-wallet is a combined collection of single or more wallets. A crypto-wallet is a private key and it can create a public key too. By using a public key, a public wallet address can be easily created. This makes a cryptocurrency wallet a set of private keys. To enable sharing wallet address with the public, they are converted into QR codes eliminated the need to maintain secrecy. One can always show QR codes to the world without any hesitation and anyone can send cryptocurrency using that wallet address. However, a cryptocurrency transaction needs a private key and currency sent into a wallet is owned by the owner of the wallet. In order to transact using cryptocurrency, a transaction is created that is public information. A transaction of crypto currency is a collection of information a blockchain needs. The only needed data for a transaction is the destination wallet’s address and the desired amount to be transferred. While anyone can transact in cryptocurrency, the transactions are only permitted by the blockchain if it is assured by multiple members in the network. A transaction should be digitally signed by a private key in order to get a valid status or else, it would be treated as invalid. In other words, one signs a transaction with the private key and then it gets to the blockchain. Once the blockchain accepts the key by confirming the public key data, it gets included in the blockchain that validates the transaction. Why you should guard your private key An attack on your private key is an attempt to steal your cryptocurrency. By using your private keys, an attacker attempts to digitally sign transactions from your wallet address to their address. Moreover, an attacker can destroy your private keys thus ending your access to your crypto wallet. What are some risk factors involved in owning a crypto wallet Before we move on to creating a security wall around our crypto currency, it is important to know from whom we are protecting our digital currency or who can prove to be a threat for our crypto wallets. If you lose the access to your crypto currency, you have lost it all as there isn’t any ledger with a centralized authority and once you lose the access, you can't regain it by any means. Since a crypto wallet is paired by a private and public key, losing the private key means losing your wallet. In other words, you don’t own any cryptocurrency. This is the very first and foremost threat. The next in line threat is what we hear often. Attackers, hackers or attempters who want to gain access to our cryptocurrency. The malfunctions may be opportunist or they may have their private intentions. Threats for your cryptocurrency Opportunist hackers are low profile attackers who get access to your laptop for transacting money to their public wallet address. Opportunist hackers doesn’t attack or target a person specifically, but if they get access to your crypto currency, they won’t shy away from taking your digital cash. Dedicated attackers, on the other hand, target single handedly or they may be in a group of hackers who work together for a sole purpose that is – stealing cryptocurrency. Their targets include every individual, crypto trader or even a crypto exchange. They initiate phishing campaigns and before executing the attack, they get well-versed with their target by conducting a pre-research. Level 2 attackers go for a broader approach and write malicious code that may steal private keys from a system if it gets attacked or infected. Another kind of hackers are backed by nation states. They are a collective group of people with top level coordination and established financials. They are motivated by gaining access to finances or their will. The crypto currency attacks by Lazarus Group, backed by the North Korea, are an example. How to Protect Your crypto wallet Regardless of the kind of threat, it is you and your private key that needs to be secured. Here’s how to ensure maximum security of your cryptocurrency. Throw away your access keys and you will lose your cryptocurrency forever. Obviously, you won’t do it ever and since the aforementioned thought came into your mind after reading the phrase, here are some other ways to secure your cryptocurrency fortune. Go through the complete password recovery process. This means going through the process of forgetting the password and creating a multi-factor token. These measures should be taken while setting up a new hosted wallet or else, be prepared to lose it all. No matter how fast the tech world progresses, basics will remain the same. You should have a printed paper backup of your keys and they should be placed in a secure location such as a bank’s locker or in a personal safe vault. Don’t forget to wipe out the printer’s memory after you are done with printing as printed files can be restored and re used to hack your digital money. Do not keeps those keys with you nor should you be hiding those keys in a closet that can get damaged due to fire, theft, etc. If your wallet has multi-signature enabled on it and has two public or private keys for the authorization of transactions, make it to three keys. While the third key will be controlled by an entrusted party, it will help you in the absence of a second person. About Author Tahha Ashraf is a Digital Content Producer at Cubix, a mobile app development company. He is a Certified Hubspot inbound and content marketer. He loves talking about brands, tech, blockchain and content marketing. Along with writing for the online fraternity on a variety of topics, he is fond of creativity and writes poetry in his free time. Cryptocurrency-based firm, Tron acquires BitTorrent Can Cryptocurrency establish a new economic world order? Akon is planning to create a cryptocurrency city in Senegal

0
0
4921

Sunith Shetty

10 Aug 2018

8 min read

Budget and Demand Forecasting using Markov model in SAS [Tutorial]

Sunith Shetty

10 Aug 2018

8 min read

0
1
7001

article-image-6-reasons-to-choose-mysql-8-for-designing-database-solutions

Amey Varangaonkar

08 May 2018

4 min read

6 reasons to choose MySQL 8 for designing database solutions

Amey Varangaonkar

08 May 2018

4 min read

Whether you are a standalone developer or an enterprise consultant, you would obviously choose a database that provides good benefits and results when compared to other related products. MySQL 8 provides numerous advantages as the first choice in this competitive market. It has various powerful features available that make it a more comprehensive database. Today we will go through the benefits of using MySQL as the preferred database solution: [box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, co-authored by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. This book presents step-by-step techniques on managing, monitoring and securing the MySQL database without any hassle.[/box] Security The first thing that comes to mind is securing data because nowadays data has become precious and can impact business continuity if legal obligations are not met; in fact, it can be so bad that it can close down your business in no time. MySQL is the most secure and reliable database management system used by many well-known enterprises such as Facebook, Twitter, and Wikipedia. It really provides a good security layer that protects sensitive information from intruders. MySQL gives access control management so that granting and revoking required access from the user is easy. Roles can also be defined with a list of permissions that can be granted or revoked for the user. All user passwords are stored in an encrypted format using plugin-specific algorithms. Scalability Day by day, the mountain of data is growing because of extensive use of technology in numerous ways. Because of this, load average is going through the roof. In some cases, it is unpredictable that data cannot exceed up to some limit or number of users will not go out of bounds. Scalable databases would be a preferable solution so that, at any point, we can meet unexpected demands to scale. MySQL is a rewarding database system for its scalability, which can scale horizontally and vertically; in terms of data, spreading database and load of application queries across multiple MySQL servers is quite feasible. It is pretty easy to add horsepower to the MySQL cluster to handle the load. An open source relational database management system MySQL is an open source database management system that makes debugging, upgrading, and enhancing the functionality fast and easy. You can view the source and make the changes accordingly and use it in your own way. You can also distribute an extended version of MySQL, but you will need to have a license for this. High performance MySQL gives high-speed transaction processing with optimal speed. It can cache the results, which boosts read performance. Replication and clustering make the system scalable for more concurrency and manages the heavy workload. Database indexes also accelerate the performance of SELECT query statements for substantial amount of data. To enhance performance, MySQL 8 has included indexes in performance schema to speed up data retrieval. High availability Today, in the world of competitive marketing, an organization's key point is to have their system up and running. Any failure or downtime directly impacts business and revenue; hence, high availability is a factor that cannot be overlooked. MySQL is quite reliable and has constant availability using cluster and replication configurations. Cluster servers instantly handle failures and manage the failover part to keep your system available almost all the time. If one server gets down, it will redirect the user's request to another node and perform the requested operation. Cross-platform capabilities MySQL provides cross-platform flexibility that can run on various platforms such as Windows, Linux, Solaris, OS2, and so on. It has great API support for the all major languages, which makes it very easy to integrate with languages such as PHP, C++, Perl, Python, Java, and so on. It is also part of the Linux Apache MySQL PHP (LAMP) server that is used worldwide for web applications. That’s it then! We discussed few important reasons of MySQL being the most popular relational database in the world and is widely adopted across many enterprises. If you want to learn more about MySQL’s administrative features, make sure to check out the book MySQL 8 Administrator’s Guide today! 12 most common MySQL errors you should be aware of Top 10 MySQL 8 performance benchmarking aspects to know

0
0
5734

Aaron Lazar

23 Apr 2018

5 min read

Why is Hadoop dying?

Aaron Lazar

23 Apr 2018

5 min read

Hadoop has been the definitive big data platform for some time. The name has practically been synonymous with the field. But while its ascent followed the trajectory of what was referred to as the 'big data revolution', Hadoop now seems to be in danger. The question is everywhere - is Hadoop dying out? And if it is, why is it? Is it because big data is no longer the buzzword it once was, or are there simply other ways of working with big data that have become more useful? Hadoop was essential to the growth of big data When Hadoop was open sourced in 2007, it opened the door to big data. It brought compute to data, as against bringing data to compute. Organisations had the opportunity to scale their data without having to worry too much about the cost. It obviously had initial hiccups with security, the complexity of querying and querying speeds, but all that was taken care off, in the long run. Still, although querying speeds remained quite a pain, however that wasn’t the real reason behind Hadoop dying (slowly). As cloud grew, Hadoop started falling One of the main reasons behind Hadoop's decline in popularity was the growth of cloud. There cloud vendor market was pretty crowded, and each of them provided their own big data processing services. These services all basically did what Hadoop was doing. But they also did it in an even more efficient and hassle-free way. Customers didn't have to think about administration, security or maintenance in the way they had to with Hadoop. One person’s big data is another person’s small data Well, this is clearly a fact. Several organisations that used big data technologies without really gauging the amount of data they actually would need to process, have suffered. Imagine sitting with 10TB Hadoop clusters when you don’t have that much data. The two biggest organisations that built products on Hadoop, Hortonworks and Cloudera, saw a decline in revenue in 2015, owing to their massive use of Hadoop. Customers weren’t pleased with nature of Hadoop’s limitations. Apache Hadoop v Apache Spark Hadoop processing is way behind in terms of processing speed. In 2014 Spark took the world by storm. I’m going to let you guess which line in the graph above might be Hadoop, and which might be Spark. Spark was a general purpose, easy to use platform that was built after studying the pitfalls of Hadoop. Spark was not bound to just the HDFS (Hadoop Distributed File System) which meant that it could leverage storage systems like Cassandra and MongoDB as well. Spark 2.3 was also able to run on Kubernetes; a big leap for containerized big data processing in the cloud. Spark also brings along GraphX, which allows developers to view data in the form of graphs. Some of the major areas Spark wins are Iterative Algorithms in Machine Learning, Interactive Data Mining and Data Processing, Stream processing, Sensor data processing, etc. Machine Learning in Hadoop is not straightforward Unlike MLlib in Spark, Machine Learning is not possible in Hadoop unless tied with a 3rd party library. Mahout used to be quite popular for doing ML on Hadoop, but its adoption has gone down in the past few years. Tools like RHadoop, a collection of 3 R packages, have grown for ML, but it still is nowhere comparable to the power of the modern day MLaaS offerings from cloud providers. All the more reason to move away from Hadoop, right? Maybe. Hadoop is not only Hadoop The general misconception is that Hadoop is quickly going to be extinct. On the contrary, the Hadoop family consists of YARN, HDFS, MapReduce, Hive, Hbase, Spark, Kudu, Impala, and 20 other products. While e folks may be moving away from Hadoop as their choice for big data processing, they will still be using Hadoop in some form or the other. As with Cloudera and Hortonworks, though the market has seen a downward trend, they’re in no way letting go of Hadoop anytime soon, although they have shifted part of their processing operations to Spark. Is Hadoop dying? Perhaps not... In the long run, it’s not completely accurate to say that Hadoop is dying. December last year brought with it Hadoop 3.0, which is supposed to be a much improved version of the framework. Some of the most noteworthy features are its improved shell script, more powerful YARN, improved fault tolerance with erasure coding, and many more. Although, that hasn’t caused any major spike in adoption, there are still users who will adopt Hadoop based on their use case, or simply use another alternative like Spark along with another framework from the Hadoop family. So, Hadoop’s not going away anytime soon. Read More Pandas is an effective tool to explore and analyze data - Interview insights

0
1
18259

article-image-analyzing-the-chief-data-officer-role

Aaron Lazar

13 Apr 2018

9 min read

GDPR is pushing the Chief Data Officer role center stage

Aaron Lazar

13 Apr 2018

9 min read

0
0
7121

article-image-top-5-programming-languages-big-data

Amey Varangaonkar

04 Apr 2018

8 min read

Top 5 programming languages for crunching Big Data effectively

Amey Varangaonkar

04 Apr 2018

8 min read

One of the most important decisions that Big Data professionals have to make, especially the ones who are new to the scene or are just starting out, is choosing the best programming languages for big data manipulation and analysis. Understanding the Big Data problem and framing the architecture to solve it is not quite enough these days - the execution needs to be perfect as well, and choosing the right language goes a long way. The best languages for big data In this article, we look at the 5 of the most popularly used - not to mention highly effective - programming languages for developing Big Data solutions. Scala A beautiful crossover of the object-oriented and functional programming paradigms, Scala is fast and robust, and a popular choice of language for many Big Data professionals.The fact that two of the most popular Big Data processing frameworks in Apache Spark and Apache Kafka have been built on top of Scala tells you everything you need to know about the power of Scala. Scala runs on the JVM, which means the codes written in Scala can be easily used within a Java-based Big Data ecosystem. One significant factor that differentiates Scala from Java, though, is that Scala is a lot less verbose in comparison. You can write 100s of lines of confusing-looking Java code in less than 15 lines in Scala. One negative aspect of Scala, though, is its steep learning curve when compared to languages like Go and Python, and this may put off beginners looking to use it. Why use Scala for big data? Fast and robust Suitable for working with Big Data tools like Apache Spark for distributed Big Data processing JVM compliant, can be used in a Java-based ecosystem Python Python has been declared as one of the fastest growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Its general-purpose nature means it can be used across a broad spectrum of use-cases, and Big Data programming is one major area of application. Many libraries for data analysis and manipulation which are increasingly being used in a Big Data framework to clean and manipulate large chunks of data, such as pandas, NumPy, SciPy - are all Python-based. Not just that, most popular machine learning and deep learning frameworks such as scikit-learn, Tensorflow and many more, are also written in Python and are finding increasing application within the Big Data ecosystem. One drawback of using Python, and a reason why it is not a first-class citizen when it comes to Big Data programming yet, is that it’s slow. Although very easy to use, Big Data professionals have found systems built with languages such as Java or Scala faster and more robust to use than the systems built with Python. However, Python makes up for this limitation with other qualities. As Python is primarily a scripting language, interactive coding and development of analytical solutions for Big Data becomes very easy. Python can integrate effortlessly with the existing Big Data frameworks such as Apache Hadoop and Apache Spark, allowing you to perform predictive analytics at scale without any problem. Why use Python for big data? General-purpose Rich libraries for data analysis and machine learning Easy to use Supports iterative development Rich integration with Big Data tools Interactive computing through Jupyter notebooks R It won’t come as a surprise to many that those who love statistics, love R. The ‘language of statistics’ as it is popularly called as, R is used to build data models which can be used for effective and accurate data analysis. Powered by a large repository of R packages (CRAN, also called as Comprehensive R Archive Network), with R you have just about every type of tool to accomplish any task in Big Data processing - right from analysis to data visualization. R can be integrated seamlessly with Apache Hadoop and Apache Spark, among other popular frameworks, for Big Data processing and analytics. One issue with using R as a programming language for Big Data is that it is not very general-purpose. It means the code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java. That said, if your goal is to only build statistical models for Big Data analytics, R is an option you should definitely consider. Why use R for big data? Built for data science Support for Hadoop and Spark Strong statistical modeling and visualization capabilities Support for Jupyter notebooks Java Last, but not the least, there’s always the good old Java. Some of the traditional Big Data frameworks such as Apache Hadoop and all the tools within its ecosystem are all Java-based, and still in use today in many enterprises. Not to mention the fact that Java is the most stable and production-ready language among all the languages we have discussed so far! Using Java to develop your Big Data applications gives you the ability to use a large ecosystem of tools and libraries for interoperability, monitoring and much more, most of which have already been tried and tested. One major drawback of Java is its verbosity. The fact that you have to write hundreds of lines of codes in Java for a task which can written in barely 15-20 lines of code in Python or Scala, can turnoff many budding programmers. However, the introduction of lambda functions in Java 8 does make life quite easier. Java also does not support iterative development unlike newer languages like Python, and this is an area of focus for the future Java releases. Despite the flaws, Java remains a strong contender when it comes to the preferred language for Big Data programming because of its history and the continued reliance on the traditional Big Data tools and frameworks. Why use Java for big data? Traditional Big Data tools and frameworks are written in Java Stable and production-ready Large ecosystem of tried and tested tools and libraries Go Last but not the least, there’s Go - one of the fastest rising programming languages in recent times. Designed by a group of Google engineers who were frustrated with C++, we think Go is a good shout in this list - simply because of the fact that it powers so many tools used in the Big Data infrastructure, including Kubernetes, Docker and many more. Go is fast, easy to learn, and fairly easy to develop applications with, not to mention deploy them. More importantly, as businesses look at building data analysis systems that can operate at scale, Go-based systems are being used to integrate machine learning and parallel processing of data. It is also possible to interface other languages with Go-based systems with relative ease. Why use Go for big data? Fast, easy to use Many tools used in the Big Data infrastructure are Go-based Efficient distributed computing There are a few other languages you might want to consider - Julia, SAS and MATLAB being some major ones which are useful in their own right. However, when compared to the languages we talked about above, we thought they fell a bit short in some aspects - be it speed, efficiency, ease of use, documentation, or community support, among other things. Let’s take a quick look at the comparison table of all the languages we discussed above. Note that we have used the ✓ symbol for the best possible language/s to help you make an informed decision. This is just our view, and that’s not to say that the other languages are any worse! Scala Python R Java Go Speed ✓ ✓ ✓ Ease of use ✓ ✓ ✓ Quick Learning curve ✓ ✓ Data Analysis capability ✓ ✓ ✓ General-purpose ✓ ✓ ✓ ✓ Big Data support ✓ ✓ ✓ ✓ ✓ Interfacing with other languages ✓ ✓ ✓ Production-ready ✓ ✓ ✓ So...which language should you choose? To answer the question in short - it all depends on the use-case you want to develop. If your focus is hardcore data analysis which involves a lot of statistical computing, R would be your go-to language. On the other hand, if you want to develop streaming applications for your Big Data, Scala can be a preferable choice. If you wish to use Machine Learning to leverage your Big Data and build predictive models, Python will come to your rescue. Lastly, if you plan to build Big Data solutions using just the traditionally-available tools, Java is the language for you. You also have the option of combining the power of two languages to get a more efficient and powerful solution. For example, you can train your machine learning model in Python and deploy it on Spark in a distributed mode. Ultimately, it all depends on how efficiently your solution can function, and more importantly, how fast and accurate it is. Which language do you prefer for crunching your Big Data? Do let us know!

0
1
19384

article-image-create-strong-data-science-project-portfolio-lands-job

Aaron Lazar

13 Feb 2018

8 min read

How to create a strong data science project portfolio that lands you a job

Aaron Lazar

13 Feb 2018

8 min read

Okay, you’re probably here because you’ve got just a few months to graduate and the projects section of your resume is blank. Or you’re just an inquisitive little nerd scraping the WWW for ways to crack that dream job. Either way, you’re not alone and there are ten thousand others trying to build a great Data Science portfolio to land them a good job. Look no further, we’ll try our best to help you on how to make a portfolio that catches the recruiter’s eye! David “Trent” Salazar‘s portfolio is a great example of a wholesome one and Sajal Sharma’s, is a good example of how one can display their Data Science Portfolios on a platform like Github. Companies are on the lookout for employees who can add value to the business. To showcase this on your resume effectively, the first step is to understand the different ways in which you can add value. 4 things you need to show in a data science portfolio Data science can be broken down into 4 broad areas: Obtaining insights from data and presenting them to the business leaders Designing an application that directly benefits the customer Designing an application or system that directly benefits other teams in the organisation Sharing expertise on data science with other teams You’ll need to ensure that your portfolio portrays all or at least most of the above, in order to easily make it through a job selection. So let’s see what we can do to make a great portfolio. Demonstrate that you know what you're doing So the idea is to show the recruiter that you’re capable of performing the critical aspects of Data Science, i.e. import a data set, clean the data, extract useful information from the data using various techniques, and finally visualise the findings and communicate them. Apart from the technical skills, there are a few soft skills that are expected as well. For instance, the ability to communicate and collaborate with others, the ability to reason and take the initiative when required. If your project is actually able to communicate these things, you’re in! Stay focused and be specific You might know a lot, but rather than throwing all your skills, projects and knowledge in the employer’s face, it’s always better to be focused on doing something and doing it right. Just as you’d do in your resume, keeping things short and sweet, you can implement this while building your portfolio too. Always remember, the interviewer is looking for specific skills. Research the data science job market Find 5-6 jobs, probably from Linkedin or Indeed, that interest you and go through their descriptions thoroughly. Understand what kind of skills the employer is looking for. For example, it could be classification, machine learning, statistical modeling or regression. Pick up the tools that are required for the job - for example, Python, R, TensorFlow, Hadoop, or whatever might get the job done. If you don’t know how to use that tool, you’ll want to skill-up as you work your way through the projects. Also, identify the kind of data that they would like you to be working on, like text or numerical, etc. Now, once you have this information at hand, start building your project around these skills and tools. Be a problem solver Working on projects that are not actual ‘problems’ that you’re solving, won’t stand out in your portfolio. The closer your projects are to the real-world, the easier it will be for the recruiter to make their decision to choose you. This will also showcase your analytical skills and how you’ve applied data science to solve a prevailing problem. Put at least 3 diverse projects in your data science portfolio A nice way to create a portfolio is to list 3 good projects that are diverse in nature. Here are some interesting projects to get you started on your portfolio: Data Cleaning and wrangling Data Cleaning is one of the most critical tasks that a data scientist performs. By taking a group of diverse data sets, consolidating and making sense of them, you’re giving the recruiter confidence that you know how to prep them for analysis. For example, you can take Twitter or Whatsapp data and clean it for analysis. The process is pretty simple; you first find a “dirty” data set, then spot an interesting angle to approach the data from, clean it up and perform analysis on it, and finally present your findings. Data storytelling Storytelling showcases not only your ability to draw insight from raw data, but it also reveals how well you’re able to convey the insights to others and persuade them. For example, you can use data from the bus system in your country and gather insights to identify which stops incur the most delays. This could be fixed by changing their route. Make sure your analysis is descriptive and your code and logic can be followed. Here’s what you do; first you find a good dataset, then you explore the data and spot correlations in the data. Then you visualize it before you start writing up your narrative. Tackle the data from various angles and pick up the most interesting one. If it’s interesting to you, it will most probably be interesting to anyone else who’s reviewing it. Break down and explain each step in detail, each code snippet, as if you were describing it to a friend. The idea is to teach the reviewer something new as you run through the analysis. End to end data science If you’re more into Machine Learning, or algorithm writing, you should do an end-to-end data science project. The project should be capable of taking in data, processing it and finally learning from it, every step of the way. For example, you can pick up fuel pricing data for your city or maybe stock market data. The data needs to be dynamic and updated regularly. The trick for this one is to keep the code simple so that it’s easy to set up and run. You first need to identify a good topic. Understand here that we will not be working with a single dataset, rather you will need to import and parse all the data and bring it under a single dataset yourself. Next, get the training and test data ready to make predictions. Document your code and other findings and you’re good to go. Prove you have the data science skill set If you want to get that job, you’ve got to have the appropriate tools to get the job done. Here’s a list of some of the most popular tools with a link to the right material for you to skill-up: Data science languages There's a number of key languages in data science that are essential. It might seem obvious, but making sure they're on your resume and demonstrated in your portfolio is incredibly important. Include things like: Python R Java Scala SQL Big Data tools If you're applying for big data roles, demonstrating your experience with the key technologies is a must. It not only proves you have the skills, but also shows that you have an awareness of what tools can be used to build a big data solution or project. You'll need: Hadoop, Spark Hive Machine learning frameworks With machine learning so in demand, if you can prove you've used a number of machine learning frameworks, you've already done a lot to impress. Remember, many organizations won't actually know as much about machine learning as you think. In fact, they might even be hiring you with a view to building out this capability. Remember to include: TensorFlow Caffe2 Keras PyTorch Data visualisation tools Data visualization is a crucial component of any data science project. If you can visualize and communicate data effectively, you're immediately demonstrating you're able to collaborate with others and make your insights accessible and useful to the wider business. Include tools like these in your resume and portfolio: D3.js Excel chart Tableau ggplot2 So there you have it. You know what to do to build a decent data science portfolio. It’s really worth attending competitions and challenges. It will not only help you keep up to data and well oiled with your skills, but also give you a broader picture of what people are actually working on and with what tools they’re able to solve problems.

0
2
13201

article-image-15-useful-python-libraries-to-make-your-data-science-tasks-easier

Amey Varangaonkar

12 Feb 2018

10 min read

15 Useful Python Libraries to make your Data Science tasks Easier

Amey Varangaonkar

12 Feb 2018

10 min read

0
0
11745

article-image-10-to-dos-for-industrial-internet-architects

Aaron Lazar

24 Jan 2018

4 min read

10 To-dos for Industrial Internet Architects

Aaron Lazar

24 Jan 2018

4 min read

[box type="note" align="" class="" width=""]This is a guest post by Robert Stackowiak, a technology business strategist at the Microsoft Technology Center. Robert has co-authored the book Architecting the Industrial Internet with Shyam Nath who is the director of technology integrations for Industrial IoT at GE Digital. You may also check out our interview with Shyam for expert insights into the world of IIoT, Big Data, Artificial Intelligence and more.[/box] Just about every day, one can pick up a technology journal or view an on-line technology article about what is new in the Industrial Internet of Things (IIoT). These articles usually provide insight into IIoT solutions to business problems or how a specific technology component is evolving to provide a function that is needed. Various industry consortia, such as the Industrial Internet Consortium (IIC), provides extremely useful documentation in defining key aspects of the IIoT architecture that the architect must consider. These broad reference architecture patterns have also begun to consistently include specific technologies and common components. The authors of the title Architecting the Industrial Internet felt the time was right for a practical guide for architects.The book provides guidance on how to define and apply an IIoT architecture in a typical project today by describing architecture patterns. In this article, we explore ten to-dos for Industrial Internet Architects designing these solutions. Just as technology components are showing up in common architecture patterns, their justification and use cases are also being discovered through repeatable processes. The sponsorship and requirements for these projects are almost always driven by leaders in the line of business in a company. Techniques for uncovering these projects can be replicated as architects gain needed discovery skills. Industrial Internet Architects To-dos: Understand IIoT: Architects first will seek to gain an understanding of what is different about the Industrial Internet, the evolution to specific IIoT solutions, and how legacy technology footprints might fit into that architecture. Understand IIoT project scope and requirements: They next research guidance from industry consortia and gather functional viewpoints. This helps to better understand the requirements their architecture must deliver solutions to, and the scope of effort they will face. Act as a bridge between business and technical requirements: They quickly come to realize that since successful projects are driven by responding to business requirements, the architect must bridge the line of business and IT divide present in many companies. They are always on the lookout for requirements and means to justify these projects. Narrow down viable IIoT solutions: Once the requirements are gathered and a potential project appears to be justifiable, requirements and functional viewpoints are aligned in preparation for defining a solution. Evaluate IIoT architectures and solution delivery models: Time to market of a proposed Industrial Internet solution is often critical to business sponsors. Most architecture evaluations include consideration of applications or pseudo-applications that can be modified to deliver the needed solution in a timely manner. Have a good grasp of IIoT analytics: Intelligence delivered by these solutions is usually linked to the timely analysis of data streams and care is taken in defining Lambda architectures (or Lambda variations) including machine learning and data management components and where analysis and response must occur. Evaluate deployment options: Technology deployment options are explored including the capabilities of proposed devices, networks, and cloud or on-premises backend infrastructures. Assess IIoT Security considerations: Security is top of mind today and proper design includes not only securing the backend infrastructure, but also extends to securing networks and the edge devices themselves. Conform to Governance and compliance policies: The viability of the Industrial Internet solution can be determined by whether proper governance is put into place and whether compliance standards can be met. Keep up with the IIoT landscape: While relying on current best practices, the architect must keep an eye on the future evaluating emerging architecture patterns and solutions. [author title="Author’s Bio" image="http://"]Robert Stackowiak is a technology business strategist at the Microsoft Technology Center in Chicago where he gathers business and technical requirements during client briefings and defines Internet of Things and analytics architecture solutions, including those that reside in the Microsoft Azure cloud. He joined Microsoft in 2016 after a 20-year stint at Oracle where he was Executive Director of Big Data in North America. Robert has spoken at industry conferences around the world and co-authored many books on analytics and data management including Big Data and the Internet of Things: Enterprise Architecture for A New Age, published by Apress, five editions of Oracle Essentials, published by O'Reilly Media, Oracle Big Data Handbook, published by Oracle Press, Achieving Extreme Performance with Oracle Exadata, published by Oracle Press, and Oracle Data Warehousing and Business Intelligence Solutions, published by Wiley. You can follow him on Twitter at @rstackow. [/author]

0
0
2528

article-image-15-ways-to-make-blockchains-scalable-secure-and-safe

Aaron Lazar

27 Dec 2017

14 min read

15 ways to make Blockchains scalable, secure and safe!

Aaron Lazar

27 Dec 2017

14 min read

[box type="note" align="" class="" width=""]This article is a book extract from Mastering Blockchain, written by Imran Bashir. In the book he has discussed distributed ledgers, decentralization and smart contracts for implementation in the real world.[/box] Today we will look at various challenges that need to be addressed before blockchain becomes a mainstream technology. Even though various use cases and proof of concept systems have been developed and the technology works well for most scenarios, some fundamental limitations continue to act as barriers to mainstream adoption of blockchains in key industries. We’ll start by listing down the main challenges that we face while working on Blockchain and propose ways to over each of those challenges: Scalability This problem has been a focus of intense debate, rigorous research, and media attention for the last few years. This is the single most important problem that could mean the difference between wider adaptability of blockchains or limited private use only by consortiums. As a result of substantial research in this area, many solutions have been proposed, which are discussed here: Block size increase: This is the most debated proposal for increasing blockchain performance (transaction processing throughput). Currently, bitcoin can process only about three to seven transactions per second, which is a major inhibiting factor in adapting the bitcoin blockchain for processing micro-transactions. Block size in bitcoin is hardcoded to be 1 MB, but if block size is increased, it can hold more transactions and can result in faster confirmation time. There are several Bitcoin Improvement Proposals (BIPs) made in favor of block size increase. These include BIP 100, BIP 101, BIP 102, BIP 103, and BIP 109. In Ethereum, the block size is not limited by hardcoding; instead, it is controlled by gas limit. In theory, there is no limit on the size of a block in Ethereum because it's dependent on the amount of gas, which can increase over time. This is possible because miners are allowed to increase the gas limit for subsequent blocks if the limit has been reached in the previous block. Block interval reduction: Another proposal is to reduce the time between each block generation. The time between blocks can be decreased to achieve faster finalization of blocks but may result in less security due to the increased number of forks. Ethereum has achieved a block time of approximately 14 seconds and, at times, it can increase. This is a significant improvement from the bitcoin blockchain, which takes 10 minutes to generate a new block. In Ethereum, the issue of high orphaned blocks resulting from smaller times between blocks is mitigated by using the Greedy Heaviest Observed Subtree (GHOST) protocol whereby orphaned blocks (uncles) are also included in determining the valid chain. Once Ethereum moves to Proof of Stake, this will become irrelevant as no mining will be required and almost immediate finality of transactions can be achieved. Invertible Bloom lookup tables: This is another approach that has been proposed to reduce the amount of data required to be transferred between bitcoin nodes. Invertible Bloom lookup tables (IBLTs) were originally proposed by Gavin Andresen, and the key attraction in this approach is that it does not result in a hard fork of bitcoin if implemented. The key idea is based on the fact that there is no need to transfer all transactions between nodes; instead, only those that are not already available in the transaction pool of the synching node are transferred. This allows quicker transaction pool synchronization between nodes, thus increasing the overall scalability and speed of the bitcoin network. Sharding: Sharding is not a new technique and has been used in distributed databases for scalability such as MongoDB and MySQL. The key idea behind sharding is to split up the tasks into multiple chunks that are then processed by multiple nodes. This results in improved throughput and reduced storage requirements. In blockchains, a similar scheme is employed whereby the state of the network is partitioned into multiple shards. The state usually includes balances, code, nonce, and storage. Shards are loosely coupled partitions of a blockchain that run on the same network. There are a few challenges related to inter-shard communication and consensus on the history of each shard. This is an open area for research. State channels: This is another approach proposed for speeding up the transaction on a blockchain network. The basic idea is to use side channels for state updating and processing transactions off the main chain; once the state is finalized, it is written back to the main chain, thus offloading the time-consuming operations from the main blockchain. State channels work by performing the following three steps: First, a part of the blockchain state is locked under a smart contract, ensuring the agreement and business logic between participants. Now off-chain transaction processing and interaction is started between the participants that update the state only between themselves for now. In this step, almost any number of transactions can be performed without requiring the blockchain and this is what makes the process fast and a best candidate for solving blockchain scalability issues. However, it could be argued that this is not a real on-blockchain solution such as, for example, sharding, but the end result is a faster, lighter, and robust network which can prove very useful in micropayment networks, IoT networks, and many other applications. Once the final state is achieved, the state channel is closed and the final state is written back to the main blockchain. At this stage, the locked part of the blockchain is also unlocked. This technique has been used in the bitcoin lightning network and Ethereum's Raiden. 6. Subchains: This is a relatively new technique recently proposed by Peter R. Rizun which is based on the idea of weak blocks that are created in layers until a strong block is found. Weak blocks can be defined as those blocks that have not been able to be mined by meeting the standard network difficulty criteria but have done enough work to meet another weaker difficulty target. Miners can build sub-chains by layering weak blocks on top of each other, unless a block is found that meets the standard difficulty target. At this point, the subchain is closed and becomes the strong block. Advantages of this approach include reduced waiting time for the first verification of a transaction. This technique also results in a reduced chance of orphaning blocks and speeds up transaction processing. This is also an indirect way of addressing the scalability issue. Subchains do not require any soft fork or hard fork to implement but need acceptance by the community. 7. Tree chains: There are also other proposals to increase bitcoin scalability, such as tree chains that change the blockchain layout from a linearly sequential model to a tree. This tree is basically a binary tree which descends from the main bitcoin chain. This approach is similar to sidechain implementation, eliminating the need for major protocol change or block size increase. It allows improved transaction throughput. In this scheme, the blockchains themselves are fragmented and distributed across the network in order to achieve scalability. Moreover, mining is not required to validate the blocks on the tree chains; instead, users can independently verify the block header. However, this idea is not ready for production yet and further research is required in order to make it practical. Privacy Privacy of transactions is a much desired property of blockchains. However, due to its very nature, especially in public blockchains, everything is transparent, thus inhibiting its usage in various industries where privacy is of paramount importance, such as finance, health, and many others. There are different proposals made to address the privacy issue and some progress has already been made. Several techniques, such as indistinguishability obfuscation, usage of homomorphic encryption, zero knowledge proofs, and ring signatures. All these techniques have their merits and demerits and are discussed in the following sections. Indistinguishability obfuscation: This cryptographic technique may serve as a silver bullet to all privacy and confidentiality issues in blockchains but the technology is not yet ready for production deployments. Indistinguishability obfuscation (IO) allows for code obfuscation, which is a very ripe research topic in cryptography and, if applied to blockchains, can serve as an unbreakable obfuscation mechanism that will turn smart contracts into a black box. The key idea behind IO is what's called by researchers a multilinear jigsaw puzzle, which basically obfuscates program code by mixing it with random elements, and if the program is run as intended, it will produce expected output but any other way of executing would render the program look random and garbage. This idea was first proposed by Sahai and others in their research paper Candidate Indistinguishability Obfuscation and Functional Encryption for All Circuits. Homomorphic encryption: This type of encryption allows operations to be performed on encrypted data. Imagine a scenario where the data is sent to a cloud server for processing. The server processes it and returns the output without knowing anything about the data that it has processed. This is also an area ripe for research and fully homomorphic encryption that allows all operations on encrypted data is still not fully deployable in production; however, major progress in this field has already been made. Once implemented on blockchains, it can allow processing on cipher text which will allow privacy and confidentiality of transactions inherently. For example, the data stored on the blockchain can be encrypted using homomorphic encryption and computations can be performed on that data without the need for decryption, thus providing privacy service on blockchains. This concept has also been implemented in a project named Enigma by MIT's Media Lab. Enigma is a peer-to-peer network which allows multiple parties to perform computations on encrypted data without revealing anything about the data. Zero knowledge proofs: Zero knowledge proofs have recently been implemented in Zcash successfully, as seen in previous chapters. More specifically, SNARKs have been implemented in order to ensure privacy on the blockchain. The same idea can be implemented in Ethereum and other blockchains also. Integrating Zcash on Ethereum is already a very active research project being run by the Ethereum R&D team and the Zcash Company. State channels: Privacy using state channels is also possible, simply due to the fact that all transactions are run off-chain and the main blockchain does not see the transaction at all except the final state output, thus ensuring privacy and confidentiality. Secure multiparty computation: The concept of secure multiparty computation is not new and is based on the notion that data is split into multiple partitions between participating parties under a secret sharing mechanism which then does the actual processing on the data without the need of the reconstructing data on single machine. The output produced after processing is also shared between the parties. MimbleWimble: The MimbleWimble scheme was proposed somewhat mysteriously on the bitcoin IRC channel and since then has gained a lot of popularity. MimbleWimble extends the idea of confidential transactions and Coinjoin, which allows aggregation of transactions without requiring any interactivity. However, it does not support the use of bitcoin scripting language along with various other features of standard Bitcoin protocol. This makes it incompatible with existing Bitcoin protocol. Therefore, it can either be implemented as a sidechain to bitcoin or on its own as an alternative cryptocurrency. This scheme can address privacy and scalability issues both at once. The blocks created using the MimbleWimble technique do not contain transactions as in traditional bitcoin blockchains; instead, these blocks are composed of three lists: an input list, output list, and something called excesses which are lists of signatures and differences between outputs and inputs. The input list is basically references to the old outputs, and the output list contains confidential transactions outputs. These blocks are verifiable by nodes by using signatures, inputs, and outputs to ensure the legitimacy of the block. In contrast to bitcoin, MimbleWimble transaction outputs only contain pubkeys, and the difference between old and new outputs is signed by all participants involved in the transactions. Coinjoin: Coinjoin is a technique which is used to anonymize the bitcoin transactions by mixing them interactively. The idea is based on forming a single transaction from multiple entities without causing any change in inputs and outputs. It removes the direct link between senders and receivers, which means that a single address can no longer be associated with transactions, which could lead to identification of the users. Coinjoin needs cooperation between multiple parties that are willing to create a single transaction by mixing payments. Therefore, it should be noted that, if any single participant in the Coinjoin scheme does not keep up with the commitment made to cooperate for creating a single transaction by not signing the transactions as required, then it can result in a denial of service attack. In this protocol, there is no need for a single trusted third party. This concept is different from mixing a service which acts as a trusted third party or intermediary between the bitcoin users and allows shuffling of transactions. This shuffling of transactions results in the prevention of tracing and the linking of payments to a particular user. Security Even though blockchains are generally secure and make use of asymmetric and symmetric cryptography as required throughout the blockchain network, there still are few caveats that can result in compromising the security of the blockchain. There are two types of Contract Verification that can solve the issue of Security. Why3 formal verification: Formal verification of solidity code is now available as a feature in the solidity browser. First the code is converted into Why3 language that the verifier can understand. In the example below, a simple solidity code that defines the variable z as maximum limit of uint is shown. When this code runs, it will result in returning 0, because uint z will overrun and start again from 0. This can also be verified using Why3, which is shown below: Once the solidity is compiled and available in the formal verification tab, it can be copied into the Why3 online IDE available at h t t p ://w h y 3. l r i . f r /t r y /. The example below shows that it successfully checks and reports integer overflow errors. This tool is under heavy development but is still quite useful. Also, this tool or any other similar tool is not a silver bullet. Even formal verification generally should not be considered a panacea because specifications in the first place should be defined appropriately: Oyente tool: Currently, Oyente is available as a Docker image for easy testing and installation. It is available at https://github.com/ethereum/oyente, and can be quickly downloaded and tested. In the example below, a simple contract taken from solidity documentation that contains a re-entrancy bug has been tested and it is shown that Oyente successfully analyzes the code and finds the bug: This sample code contains a re-entrancy bug which basically means that if a contract is interacting with another contract or transferring ether, it is effectively handing over the control to that other contract. This allows the called contract to call back into the function of the contract from which it has been called without waiting for completion. For example, this bug can allow calling back into the withdraw function shown in the preceding example again and again, resulting in getting Ethers multiple times. This is possible because the share value is not set to 0 until the end of the function, which means that any later invocations will be successful, resulting in withdrawing again and again. An example is shown of Oyente running to analyze the contract shown below and as can be seen in the following output, the analysis has successfully found the re-entrancy bug. The bug is proposed to be handled by a combination of the Checks-Effects-Interactions pattern described in the solidity documentation: We discussed the scalability, security, confidentiality, and privacy aspects of blockchain technology in this article. If you found it useful, go ahead and buy the book, Mastering Blockchain by Imran Bashir to become a professional Blockchain developer.

0
0
5259

article-image-how-do-data-structures-and-data-models-differ

Amey Varangaonkar

21 Dec 2017

7 min read

How do Data Structures and Data Models differ?

Amey Varangaonkar

21 Dec 2017

7 min read

0
0
8596

article-image-6-key-skills-data-scientist-role

Amey Varangaonkar

21 Dec 2017

6 min read

6 Key Areas to focus on while transitioning to a Data Scientist role

Amey Varangaonkar

21 Dec 2017

6 min read

[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Statistics for Data Science, authored by James D. Miller. The book dives into the different statistical approaches to discover hidden insights and patterns from different kinds of data.[/box] Being a data scientist is undoubtedly a very lucrative career prospect. In fact, it is one of the highest paying jobs in the world right now. That said, transitioning from a data developer role to a data scientist role needs careful planning and a clear understanding of what the role is all about. In this interesting article, the author highlights six key skills must give special attention to, during this transition. Let's start by taking a moment to state what I consider to be a few generally accepted facts about transitioning to a data scientist. We'll reaffirm these beliefs as we continue through this book: Academia: Data scientists are not all from one academic background. They are not all computer science or statistics/mathematics majors. They do not all possess an advanced degree (in fact, you can use statistics and data science with a bachelor's degree or even less). It's not magic-based: Data scientists can use machine learning and other accepted statistical methods to identify insights from data, not magic. They are not all tech or computer geeks: You don't need years of programming experience or expensive statistical software to be effective. You don't need to be experienced to get started. You can start today, right now. (Well, you already did when you bought this book!) Okay, having made the previous declarations, let's also be realistic. As always, there is an entry-point for everything in life, and, to give credit where it is due, the more credentials you can acquire to begin out with, the better off you will most likely be. Nonetheless, (as we'll see later in this chapter), there is absolutely no valid reason why you cannot begin understanding, using, and being productive with data science and statistics immediately. [box type="info" align="" class="" width=""]As with any profession, certifications and degrees carry the weight that may open the doors, while experience, as always, might be considered the best teacher. There are, however, no fake data scientists but only those with currently more desire than practical experience.[/box] If you are seriously interested in not only understanding statistics and data science but eventually working as a full-time data scientist, you should consider the following common themes (you're likely to find in job postings for data scientists) as areas to focus on: Education Common fields of study here are Mathematics and Statistics, followed by Computer Science and Engineering (also Economics and Operations research). Once more, there is no strict requirement to have an advanced or even related degree. In addition, typically, the idea of a degree or an equivalent experience will also apply here. Technology You will hear SAS and R (actually, you will hear quite a lot about R) as well as Python, Hadoop, and SQL mentioned as key or preferable for a data scientist to be comfortable with, but tools and technologies change all the time so, as mentioned several times throughout this chapter, data developers can begin to be productive as soon as they understand the objectives of data science and various statistical mythologies without having to learn a new tool or language. [box type="info" align="" class="" width=""]Basic business skills such as Omniture, Google Analytics, SPSS, Excel, or any other Microsoft Office tool are assumed pretty much everywhere and don't really count as an advantage, but experience with programming languages (such as Java, PERL, or C++) or databases (such as MySQL, NoSQL, Oracle, and so on.) does help![/box] Data The ability to understand data and deal with the challenges specific to the various types of data, such as unstructured, machine-generated, and big data (including organizing and structuring large datasets). [box type="info" align="" class="" width=""]Unstructured data is a key area of interest in statistics and for a data scientist. It is usually described as data having no redefined model defined for it or is not organized in a predefined manner. Unstructured information is characteristically text-heavy but may also contain dates, numbers, and various other facts as well.[/box] Intellectual Curiosity I love this. This is perhaps well defined as a character trait that comes in handy (if not required) if you want to be a data scientist. This means that you have a continuing need to know more than the basics or want to go beyond the common knowledge about a topic (you don't need a degree on the wall for this!) Business Acumen To be a data developer or a data scientist you need a deep understanding of the industry you're working in, and you also need to know what business problems your organization needs to unravel. In terms of data science, being able to discern which problems are the most important to solve is critical in addition to identifying new ways the business should be leveraging its data. Communication Skills All companies look for individuals who can clearly and fluently translate their findings to a non-technical team, such as the marketing or sales departments. As a data scientist, one must be able to enable the business to make decisions by arming them with quantified insights in addition to understanding the needs of their non-technical colleagues to add value and be successful. This article paints a much clearer picture on why soft skills play an important role in becoming a better data scientist. So why should you, a data developer, endeavor to think like (or more like) a data scientist? Specifically, what might be the advantages of thinking like a data scientist? The following are just a few notions supporting the effort: Developing a better approach to understanding data Using statistical thinking during the process of program or database designing Adding to your personal toolbox Increased marketability If you found this article to be useful, make sure you check out the book Statistics for Data Science, which includes a comprehensive list of tips and tricks to becoming a successful data scientist by mastering the basic and not-so-basic concepts of statistics.

0
0
4245

article-image-two-popular-data-analytics-methodologies-every-data-professional-should-know-tdsp-crisp-dm

Amarabha Banerjee

21 Dec 2017

7 min read

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

Amarabha Banerjee

21 Dec 2017

7 min read

0
0
16748

article-image-when-why-and-how-to-use-graph-analytics-for-your-big-data

Sunith Shetty

20 Dec 2017

10 min read

When, why and how to use Graph analytics for your big data

Sunith Shetty

20 Dec 2017

10 min read

0
0
12479

Tech Guides - Big Data

What does a data science team look like?

How to secure your crypto currency

Budget and Demand Forecasting using Markov model in SAS [Tutorial]

6 reasons to choose MySQL 8 for designing database solutions

Why is Hadoop dying?

GDPR is pushing the Chief Data Officer role center stage

Top 5 programming languages for crunching Big Data effectively

How to create a strong data science project portfolio that lands you a job

15 Useful Python Libraries to make your Data Science tasks Easier

10 To-dos for Industrial Internet Architects

Trending Topics

15 ways to make Blockchains scalable, secure and safe!

How do Data Structures and Data Models differ?

6 Key Areas to focus on while transitioning to a Data Scientist role

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

When, why and how to use Graph analytics for your big data