Tech Guides

article-image-top-5-programming-languages-big-data

04 Apr 2018

8 min read

Top 5 programming languages for crunching Big Data effectively

04 Apr 2018

One of the most important decisions that Big Data professionals have to make, especially the ones who are new to the scene or are just starting out, is choosing the best programming languages for big data manipulation and analysis. Understanding the Big Data problem and framing the architecture to solve it is not quite enough these days - the execution needs to be perfect as well, and choosing the right language goes a long way. The best languages for big data In this article, we look at the 5 of the most popularly used - not to mention highly effective - programming languages for developing Big Data solutions. Scala A beautiful crossover of the object-oriented and functional programming paradigms, Scala is fast and robust, and a popular choice of language for many Big Data professionals.The fact that two of the most popular Big Data processing frameworks in Apache Spark and Apache Kafka have been built on top of Scala tells you everything you need to know about the power of Scala. Scala runs on the JVM, which means the codes written in Scala can be easily used within a Java-based Big Data ecosystem. One significant factor that differentiates Scala from Java, though, is that Scala is a lot less verbose in comparison. You can write 100s of lines of confusing-looking Java code in less than 15 lines in Scala. One negative aspect of Scala, though, is its steep learning curve when compared to languages like Go and Python, and this may put off beginners looking to use it. Why use Scala for big data? Fast and robust Suitable for working with Big Data tools like Apache Spark for distributed Big Data processing JVM compliant, can be used in a Java-based ecosystem Python Python has been declared as one of the fastest growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Its general-purpose nature means it can be used across a broad spectrum of use-cases, and Big Data programming is one major area of application. Many libraries for data analysis and manipulation which are increasingly being used in a Big Data framework to clean and manipulate large chunks of data, such as pandas, NumPy, SciPy - are all Python-based. Not just that, most popular machine learning and deep learning frameworks such as scikit-learn, Tensorflow and many more, are also written in Python and are finding increasing application within the Big Data ecosystem. One drawback of using Python, and a reason why it is not a first-class citizen when it comes to Big Data programming yet, is that it’s slow. Although very easy to use, Big Data professionals have found systems built with languages such as Java or Scala faster and more robust to use than the systems built with Python. However, Python makes up for this limitation with other qualities. As Python is primarily a scripting language, interactive coding and development of analytical solutions for Big Data becomes very easy. Python can integrate effortlessly with the existing Big Data frameworks such as Apache Hadoop and Apache Spark, allowing you to perform predictive analytics at scale without any problem. Why use Python for big data? General-purpose Rich libraries for data analysis and machine learning Easy to use Supports iterative development Rich integration with Big Data tools Interactive computing through Jupyter notebooks R It won’t come as a surprise to many that those who love statistics, love R. The ‘language of statistics’ as it is popularly called as, R is used to build data models which can be used for effective and accurate data analysis. Powered by a large repository of R packages (CRAN, also called as Comprehensive R Archive Network), with R you have just about every type of tool to accomplish any task in Big Data processing - right from analysis to data visualization. R can be integrated seamlessly with Apache Hadoop and Apache Spark, among other popular frameworks, for Big Data processing and analytics. One issue with using R as a programming language for Big Data is that it is not very general-purpose. It means the code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java. That said, if your goal is to only build statistical models for Big Data analytics, R is an option you should definitely consider. Why use R for big data? Built for data science Support for Hadoop and Spark Strong statistical modeling and visualization capabilities Support for Jupyter notebooks Java Last, but not the least, there’s always the good old Java. Some of the traditional Big Data frameworks such as Apache Hadoop and all the tools within its ecosystem are all Java-based, and still in use today in many enterprises. Not to mention the fact that Java is the most stable and production-ready language among all the languages we have discussed so far! Using Java to develop your Big Data applications gives you the ability to use a large ecosystem of tools and libraries for interoperability, monitoring and much more, most of which have already been tried and tested. One major drawback of Java is its verbosity. The fact that you have to write hundreds of lines of codes in Java for a task which can written in barely 15-20 lines of code in Python or Scala, can turnoff many budding programmers. However, the introduction of lambda functions in Java 8 does make life quite easier. Java also does not support iterative development unlike newer languages like Python, and this is an area of focus for the future Java releases. Despite the flaws, Java remains a strong contender when it comes to the preferred language for Big Data programming because of its history and the continued reliance on the traditional Big Data tools and frameworks. Why use Java for big data? Traditional Big Data tools and frameworks are written in Java Stable and production-ready Large ecosystem of tried and tested tools and libraries Go Last but not the least, there’s Go - one of the fastest rising programming languages in recent times. Designed by a group of Google engineers who were frustrated with C++, we think Go is a good shout in this list - simply because of the fact that it powers so many tools used in the Big Data infrastructure, including Kubernetes, Docker and many more. Go is fast, easy to learn, and fairly easy to develop applications with, not to mention deploy them. More importantly, as businesses look at building data analysis systems that can operate at scale, Go-based systems are being used to integrate machine learning and parallel processing of data. It is also possible to interface other languages with Go-based systems with relative ease. Why use Go for big data? Fast, easy to use Many tools used in the Big Data infrastructure are Go-based Efficient distributed computing There are a few other languages you might want to consider - Julia, SAS and MATLAB being some major ones which are useful in their own right. However, when compared to the languages we talked about above, we thought they fell a bit short in some aspects - be it speed, efficiency, ease of use, documentation, or community support, among other things. Let’s take a quick look at the comparison table of all the languages we discussed above. Note that we have used the ✓ symbol for the best possible language/s to help you make an informed decision. This is just our view, and that’s not to say that the other languages are any worse! Scala Python R Java Go Speed ✓ ✓ ✓ Ease of use ✓ ✓ ✓ Quick Learning curve ✓ ✓ Data Analysis capability ✓ ✓ ✓ General-purpose ✓ ✓ ✓ ✓ Big Data support ✓ ✓ ✓ ✓ ✓ Interfacing with other languages ✓ ✓ ✓ Production-ready ✓ ✓ ✓ So...which language should you choose? To answer the question in short - it all depends on the use-case you want to develop. If your focus is hardcore data analysis which involves a lot of statistical computing, R would be your go-to language. On the other hand, if you want to develop streaming applications for your Big Data, Scala can be a preferable choice. If you wish to use Machine Learning to leverage your Big Data and build predictive models, Python will come to your rescue. Lastly, if you plan to build Big Data solutions using just the traditionally-available tools, Java is the language for you. You also have the option of combining the power of two languages to get a more efficient and powerful solution. For example, you can train your machine learning model in Python and deploy it on Spark in a distributed mode. Ultimately, it all depends on how efficiently your solution can function, and more importantly, how fast and accurate it is. Which language do you prefer for crunching your Big Data? Do let us know!

0
1
18322

article-image-devops-engineering-and-full-stack-development

Richard Gall

28 Jul 2015

5 min read

DevOps engineering and full-stack development – 2 sides of the same agile coin

Richard Gall

28 Jul 2015

5 min read

Two of the most talked-about and on-trend roles in tech dominated our Skill Up survey – DevOps engineers and Full-Stack developers. Even before we started exploring our data, we knew that both would feature heavily. Given the amount of time spent online arguing about DevOps and the merits and drawbacks of full-stack development, it’s interesting to see exactly what it means to be a DevOps engineer or full-stack developer. From salary to tool use, both our Web Development and SysAdmin and Security Salary and Skills Reports offer an insight into the professional lives of people actually performing these roles every day. The similarities between DevOps engineering and full-stack development The similarities between the two roles are striking. Both DevOps engineering and full-stack development are having a considerable impact on the way in which technology is used and understood within organizations and businesses – which makes them particularly valuable. In SMEs, for example, DevOps engineers command almost the same amount of money as in Enterprise. Considering the current economic climate, it’s a clear signal of the value of DevOps practices in environments where flexibility and the ability to adapt to changing demands and expectations are crucial to growth. Full-stack developers also command the highest salaries in many industries. In consultancy, for example, full-stack developers earn significantly more than any other web development role. While this could suggest that organizations aren’t yet willing to invest in (or simply don’t need) in-house full-stack developers, it highlights that they are nevertheless willing to spend money on individuals with full-stack knowledge, who are capable of delivering cutting-edge insight. However, just as we saw Cloud consultancies dominate the tech consultancy market a few years ago, over time it’s likely that full-stack development will become more and more established as a standard. DevOps engineers and full-stack developers share the same philosophical germ. They are symptoms of a growing business demand for greater agility and flexibility, and hint at a trend towards greater generalization in the skillset of technical professionals. part of the thrill of #devops to me is how there's no true agreement about what it is. it's like watching LOST all over again — jon devops hendren (@devops) May 18, 2015 Full-stack developers are using DevOps tools I’ve always seen them as manifestations of similar ideas in different technical areas. However, when you look at the data we’ve collected in our survey, alongside some wider research, the relationship between the DevOps engineer and the Full-Stack developer might possibly be more than purely conceptual. ‘Full-Stack’ and ‘DevOps’ are both terms that blur the lines between developer and engineer, and both are two sides of an intriguing form of cross-pollination; technologies more commonly used for deployment and automation. Docker and Vagrant were the most notable, highlighting the impact of containerization and virtualization on web development, but we also found a number of references to the Microsoft automation tool PowerShell – a distinctly DevOps-esque tool if ever there was one. Perhaps there’s a danger of overstating my point – surely we shouldn’t be surprised if web developers are using these tools – it’s not that strange, right? Maybe, but the fact that tools such as these are being used by web developers in their day-to-day work suggests that they are no longer simply expected to develop: they also need to deploy and configure their projects. Indeed, it’s worth noting that across all our web development respondents, a large number plan on learning Docker over the next 12 months. DevOps engineers use a huge range of tools DevOps Engineers were even more eclectic in their tool-usage than full-stack developers. Python is the language of-choice and Puppet the go-to configuration management tool, but web tools such as JavaScript and PHP are also being used. References to Flask, for example, the Python microframework, emphasise the way in which DevOps Engineers have an eye on web development while they’re automating your infrastructure. Taken alone, these responses might not fully evidence the relationship between DevOps engineers and Full-Stack developers. However, there are jobs out there asking for a combination of both skillsets. One, posted by a recruiter working for a nameless ‘creative media house’ in London, was looking for someone to become ‘a key member of multi-party cloud research projects, helping to bring a microservices-based video automation system to life, integrate development and developed systems into onside and global infrastructure’. The tools being asked for were very varied indeed. From a high-level language, such as JavaScript, to scripting languages such as Bash, Python and Perl, to continuous integration tools, configuration management tools and containerization technologies, whoever eventually gets the job certainly deserves to be called a polyglot. Blurring the line between full-stack and DevOps A further indication of the blurred line between engineers and developers can be found in this article from computing.co.uk. It’s an interesting tale of how working practices develop according to necessity and how methodologies and ideas interact with the practical details of a given situation. It tells the story of how the Washington Post went about building its submission platform, and how the way in which the project was resourced and managed changed according to certain pressures – internal and external. The title might actually be misleading – if you read it, it’s not so much that DevOps necessitates full-stack development, more that each thing grows out of the next. It might even be said that the reverse is true – that full-stack development necessitates DevOps thinking. The relationship between DevOps and full-stack development gives a real indication of the state of the tech world in 2015. Within a tech landscape of increasing complexity and cross-pollination there are going to be opportunities for developers and engineers to significantly drive their value as technical professionals. It’s simply a question of learning more, and of being open to new challenges and ideas about how to work effectively. It probably won’t be easy, but it might just be a fun journey.

0
0
18189

article-image-5-types-of-deep-transfer-learning

Bhagyashree R

25 Nov 2018

5 min read

5 types of deep transfer learning

Bhagyashree R

25 Nov 2018

5 min read

Transfer learning is a method of reusing a model or knowledge for another related task. Transfer learning is sometimes also considered as an extension of existing ML algorithms. Extensive research and work is being done in the context of transfer learning and on understanding how knowledge can be transferred among tasks. However, the Neural Information Processing Systems (NIPS) 1995 workshop Learning to Learn: Knowledge Consolidation and Transfer in Inductive Systems is believed to have provided the initial motivations for research in this field. The literature on transfer learning has gone through a lot of iterations, and the terms associated with it have been used loosely and often interchangeably. Hence, it is sometimes confusing to differentiate between transfer learning, domain adaptation, and multitask learning. Rest assured, these are all related and try to solve similar problems. In this article, we will look into the five types of deep transfer learning to get more clarity on how these differ from each other. [box type="shadow" align="" class="" width=""]This article is an excerpt from a book written by Dipanjan Sarkar, Raghav Bali, and Tamoghna Ghosh titled Hands-On Transfer Learning with Python. This book covers deep learning and transfer learning in detail. It also focuses on real-world examples and research problems using TensorFlow, Keras, and the Python ecosystem with hands-on examples.[/box] #1 Domain adaptation Domain adaptation is usually referred to in scenarios where the marginal probabilities between the source and target domains are different, such as P (Xs) ≠ P (Xt). There is an inherent shift or drift in the data distribution of the source and target domains that requires tweaks to transfer the learning. For instance, a corpus of movie reviews labeled as positive or negative would be different from a corpus of product-review sentiments. A classifier trained on movie-review sentiment would see a different distribution if utilized to classify product reviews. Thus, domain adaptation techniques are utilized in transfer learning in these scenarios. #2 Domain confusion Different layers in a deep learning network capture different sets of features. We can utilize this fact to learn domain-invariant features and improve their transferability across domains. Instead of allowing the model to learn any representation, we nudge the representations of both domains to be as similar as possible. This can be achieved by applying certain preprocessing steps directly to the representations themselves. Some of these have been discussed by Baochen Sun, Jiashi Feng, and Kate Saenko in their paper Return of Frustratingly Easy Domain Adaptation. This nudge toward the similarity of representation has also been presented by Ganin et. al. in their paper, Domain-Adversarial Training of Neural Networks. The basic idea behind this technique is to add another objective to the source model to encourage similarity by confusing the domain itself, hence domain confusion. #3 Multitask learning Multitask learning is a slightly different flavor of the transfer learning world. In the case of multitask learning, several tasks are learned simultaneously without distinction between the source and targets. In this case, the learner receives information about multiple tasks at once, as compared to transfer learning, where the learner initially has no idea about the target task. This is depicted in the following diagram: Multitask learning: Learner receives information from all tasks simultaneously #4 One-shot learning Deep learning systems are data hungry by nature, such that they need many training examples to learn the weights. This is one of the limiting aspects of deep neural networks, though such is not the case with human learning. For instance, once a child is shown what an apple looks like, they can easily identify a different variety of apple (with one or a few training examples); this is not the case with ML and deep learning algorithms. One-shot learning is a variant of transfer learning where we try to infer the required output based on just one or a few training examples. This is essentially helpful in real-world scenarios where it is not possible to have labeled data for every possible class (if it is a classification task) and in scenarios where new classes can be added often. The landmark paper by Fei-Fei and their co-authors, One Shot Learning of Object Categories, is supposedly what coined the term one-shot learning and the research in this subfield. This paper presented a variation on a Bayesian framework for representation learning for object categorization. This approach has since been improved upon, and applied using deep learning systems. #5 Zero-shot learning Zero-shot learning is another extreme variant of transfer learning, which relies on no labeled examples to learn a task. This might sound unbelievable, especially when learning using examples is what most supervised learning algorithms are about. Zero-data learning, or zero-short learning, methods make clever adjustments during the training stage itself to exploit additional information to understand unseen data. In their book on Deep Learning, Goodfellow and their co-authors present zero-shot learning as a scenario where three variables are learned, such as the traditional input variable, x, the traditional output variable, y, and the additional random variable that describes the task, T. The model is thus trained to learn the conditional probability distribution of P(y | x, T). Zero-shot learning comes in handy in scenarios such as machine translation, where we may not even have labels in the target language. In this article we learned about the five types of deep transfer learning types: Domain adaptation, domain confusion, multitask learning, one-shot learning, and zero-shot learning. If you found this post useful, do check out the book, Hands-On Transfer Learning with Python, which covers deep learning and transfer learning in detail. It also focuses on real-world examples and research problems using TensorFlow, Keras, and the Python ecosystem with hands-on examples. CMU students propose a competitive reinforcement learning approach based on A3C using visual transfer between Atari games What is Meta Learning? Is the machine learning process similar to how humans learn?

0
0
18103

article-image-nvidia-leads-the-ai-hardware-race-but-which-of-its-gpus-should-you-use-for-deep-learning

Prasad Ramesh

29 Aug 2018

8 min read

NVIDIA leads the AI hardware race. But which of its GPUs should you use for deep learning?

Prasad Ramesh

29 Aug 2018

8 min read

For readers who are new to deep learning and who might be wondering what a GPU is, let’s start there. To make it simple, consider deep learning as nothing more than a set of calculations - complex calculations, yes, but calculations nonetheless. To run these calculations, you need hardware. Ordinarily, you might just use a normal processor like the CPU inside your laptop. However, this isn’t powerful enough to process at the speed at which deep learning computations need to happen. GPUs, however, can. This is because while a conventional CPU has only a few complex cores, a GPU can have thousands of simple cores. With a GPU, training a deep learning data set can take just hours instead of days. However, although it’s clear that GPUs have significant advantages over CPUs, there is nevertheless a range of GPUs available, each having their own individual differences. Selecting one is ultimately a matter of knowing what your needs are. Let’s dig deeper and find out how to go about shopping for GPUs… What to look for before choosing a GPU? There are a few specifications to consider before picking a GPU. Memory bandwidth: This determines the capacity of a GPU to handle large amounts of data. It is the most important performance metric, as with faster memory bandwidth more data can be processed at higher speeds. Number of cores: This indicates how fast a GPU can process data. A large number of CUDA cores can handle large datasets well. CUDA cores are parallel processors similar to cores in a CPU but their number is in thousands and are not suited for complex calculations that a CPU core can perform. Memory size: For computer vision projects, it is crucial for memory size to be as much as you can afford. But with natural language processing, memory size does not play such an important role. Our pick of GPU devices to choose from The go to choice here is NVIDIA; they have standard libraries that make it simple to set things up. Other graphics cards are not very friendly in terms of the libraries supported for deep learning. NVIDIA CUDA Deep Neural Network library also has a good development community. “Is NVIDIA Unstoppable In AI?” -Forbes “Nvidia beats forecasts as sales of graphics chips for AI keep booming” -SiliconANGLE AMD GPUs are powerful too but lack library support to get things running smoothly. It would be really nice to see some AMD libraries being developed to break the monopoly and give more options to the consumers. NVIDIA RTX 2080 Ti: The RTX line of GPUs are to be released in September 2018. The RTX 2080 Ti will be twice as fast as the 1080 Ti. Price listed on NVIDIA website for founder’s edition is $1,199. RAM: 11 GB Memory bandwidth: 616 GBs/second Cores: 4352 cores @ 1545 MHz NVIDIA RTX 2080: This is more cost efficient than the 2080 Ti at a listed price of $799 on NVIDIA website for the founder’s edition. RAM: 8 GB Memory bandwidth: 448 GBs/second Cores: 2944 cores @ 1710 MHz NVIDIA RTX 2070: This is more cost efficient than the 2080 Ti at a listed price of $599 on NVIDIA website. Note that the other versions of the RTX cards will likely be cheaper than the founder’s edition around a $100 difference. RAM: 8 GB Memory bandwidth: 448 GBs/second Cores: 2304 cores @ 1620 MHz NVIDIA GTX 1080 Ti: Priced at $650 on Amazon. This is a higher end option but offers great value for money, and can also do well in Kaggle competitions. If you need more memory but cannot afford the RTX 2080 Ti go for this. RAM: 11 GB Memory bandwidth: 484 GBs/second Cores: 3584 cores @ 1582 MHz NVIDIA GTX 1080: Priced at $584 on Amazon. This is a mid-high end option only slightly behind the 1080Ti. VRAM: 8 GB Memory bandwidth: 320 GBs/second Processing power: 2560 cores @ 1733 MHz NVIDIA GTX 1070 Ti: Priced at around $450 on Amazon. This is slightly less performant than the GTX 1080 but $100 cheaper. VRAM: 8 GB Memory bandwidth: 256 GBs/second Processing power: 2438 cores @ 1683 MHz NVIDIA GTX 1070: Priced at $380 on Amazon is currently the bestseller because of crypto miners. Somewhat slower than the 1080 GPUs but cheaper. VRAM: 8 GB Memory bandwidth: 256 GBs/second Processing power: 1920 cores @ 1683 MHz NVIDIA GTX 1060 6GB: Priced at around $290 on Amazon. Pretty cheap but the 6 GB VRAM limits you. Should be good for NLP but you’ll find the performance lacking in computer vision. VRAM: 6 GB Memory bandwidth: 216 GBs/second Processing power: 1280 cores @ 1708 MHz NVIDIA GTX 1050 Ti: Priced at around $200 on Amazon. This is the cheapest workable option. Good to get started with deep learning and explore if you’re new. VRAM: 4 GB Memory bandwidth: 112 GBs/second Processing power: 768 cores @ 1392 MHz NVIDIA Titan XP: The Titan XP is also an option but gives only a marginally better performance while being almost twice as expensive as the GTX 1080 Ti, it has 12 GB memory, 547.7 GB/s bandwidth and 3840 cores @ 1582 MHz. On a side note, NVIDIA Quadro GPUs are pretty expensive and don’t really help in deep learning they are more of use in CAD and working with heavy graphics production tasks. The graph below does a pretty good job of visualizing how all the GPUs above compare: Source: Slav Ivanov Blog, processing power is calculated as CUDA cores times the clock frequency Does the number of GPUs matter? Yes, it does. But how many do you really need? What’s going to suit the scale of your project without breaking your budget? 2 GPUs will always yield better results than just one - but it’s only really worth it if you need the extra power. There are two options you can take with multi-GPU deep learning. On the one hand, you can train several different models at once across your GPUs, or, alternatively distribute one single training model across multiple GPUs known as “multi-GPU training”. The latter approach is compatible with TensorFlow, CNTK, and PyTorch. Both of these approaches have advantages. Ultimately, it depends on how many projects you’re working on and, again, what your needs are. Another important point to bear in mind is that if you’re using multiple GPUs, the processor and hard disk need to be fast enough to feed data continuously - otherwise the multi-GPU approach is pointless. Source: NVIDIA website It boils down to your needs and budget, GPUs aren’t exactly cheap. Other heavy devices There are also other large machines apart from GPUs. These include the specialized supercomputer from NVIDIA, the DGX-2, and Tensor processing units (TPUs) from Google. The NVIDIA DGX-2 If you thought GPUs were expensive, let me introduce you to NVIDIA DGX-2, the successor to the NVIDIA DGX-1. It’s a highly specialized workstation; consider it a supercomputer that has been specially designed to tackle deep learning. The price of the DGX-2 is (*gasp*) $399,000. Wait, what? I could buy some new hot wheels for that, or Dual Intel Xeon Platinum 8168, 2.7 GHz, 24-cores, 16 NVIDIA GPUs, 1.5 terabytes of RAM, and nearly 32 terabytes of SSD storage! The performance here is 2 petaFLOPS. Let’s be real: many of us probably won’t be able to afford it. However, NVIDIA does have leasing options, should you choose to try it. Practically speaking, this kind of beast finds its use in research work. In fact, the first DGX-1 was gifted to OpenAI by NVIDIA to promote AI research. Visit the NVIDIA website for more on these monster machines. There are also personal solutions available like the NVIDIA DGX Workstation. TPUs Now that you’ve caught your breath after reading about AI dream machines, let’s look at TPUs. Unlike the DGX machines, TPUs run on the cloud. A TPU is what’s referred to as an application-specific integrated circuit (ASIC) that has been designed specifically for machine learning and deep learning by Google. Here’s the key stats: Cloud TPUs can provide up to 11.5 petaflops of performance in a single pod. If you want to learn more, go to Google’s website. When choosing GPUs you need to weigh up your options The GTX 1080 Ti is most commonly used by researchers and competitively for Kaggle, as it gives good value for money. Go for this if you are sure about what you want to do with deep learning. The GTX 1080 and GTX 1070 Ti are cheaper with less computing power, a more budget friendly option if you cannot afford the 1080 Ti. GTX 1070 saves you some more money but is slower. The GTX 1060 6GB and GTX 1050 Ti are good if you’re just starting off in the world of deep learning without burning a hole in your pockets. If you must have the absolute best GPU irrespective of the cost then the RTX 2080 Ti is your choice. It offers twice the performance for almost twice the cost of a 1080 Ti. Nvidia unveils a new Turing architecture: “The world’s first ray tracing GPU” Nvidia GPUs offer Kubernetes for accelerated deployments of Artificial Intelligence workloads Nvidia’s Volta Tensor Core GPU hits performance milestones. But is it the best?

0
0
18074

article-image-webgl-20-what-you-need-know

Raka Mahesa

01 May 2017

5 min read

WebGL 2.0: What you need to know

Raka Mahesa

01 May 2017

5 min read

Earlier this year, Google and Mozilla released a version of Chrome and Firefox that has full support for WebGL 2.0. While some of the previous versions of their browsers also have support for WebGL 2.0, those versions by default disable the WebGL 2.0 feature. By enabling WebGL 2.0 in their latest browser version, it seems both Google and Mozilla are confident that this bleeding edge web technology can finally be used by most users without any problems. So, what is WebGL 2.0? How does it differ from the previous version of WebGL? What, in fact, is WebGL? To answer those questions, let's go back in time a little bit. In the early 1990s, graphics intensive applications were expensive to create because the software had to be customized for each type of graphic processing hardware. Imagine having to write an app for each smartphone vendor separately; it would cost many man hours. So, to mitigate this problem, a standard for graphics computing was introduced. This standard is called OpenGL (which stands for Open Graphics Library). When mobile phones with display screens were introduced, people realized that mobile technology also needed a standard for graphics computing. However, OpenGL is a standard primarily for desktop-class hardware, so they realized that they would need a different standard that could work with the limited capability of mobile hardware. And thus OpenGL ES (Embedded System) was branched out from OpenGL, and the initial version was released in the early 2000s. The same progression happened to web technology. In 2009, web applications became increasingly graphic-intensive, so a graphical standard called WebGL was introduced to help software developers. One thing noted, however, was that users could access web applications from both desktop and mobile devices, so WebGL needed to work on both platforms. To accommodate that, WebGL was created based on the OpenGL ES specification instead of the desktop-focused OpenGL. Technology keeps advancing. As graphics hardware becomes more capable, additional features get added to the graphical standards. The latest version of OpenGL ES, version 3.0, was released in 2012 to keep up with the advancement in mobile GPU. WebGL 1.0, however, was still based on OpenGL ES 2.0. So in 2017, the specification for WebGL 2.0, which was based on OpenGL ES 3.0, was finally released. As we can see from the timeline, WebGL 2.0 is really fresh out the oven. In fact, it's so new, that at the time of writing this article, the only browsers that support the standards are Google Chrome, Mozilla Firefox, and the Opera browser. WebGL 2.0 support on Safari is still under development. Also, it's worth noting that no mobile browser supports WebGL 2.0 by default (WebGL 2.0 support on Chrome for Android can be enabled via a hidden menu). Considering the limited number of compatible platforms, as developers, we really can't rely on the user to have the necessary browser for our apps. So, with that limitation in mind, we have to always check for the browser's capability and prepare a fallback method in the event that the browser does not support WebGL 2.0. So, how does WebGL version 2.0 differ from version 1.0? Fortunately, nothing major has changed with the way the library is used. This latest version of WebGL simply adds additional features and also makes some optional extensions of the library to be included by default. One of the WebGL 1.0 extensions that have been made mandatory on WebGL 2.0 is the Instancing extension, which enables developers to render multiple copies of the same mesh efficiently. This feature is very useful for drawing decorative objects, like grass. Another extension that has been included in WebGL 2.0 is Depth Texture, which is used a lot for computing lighting and creating shadow maps. Another major addition to WebGL 2.0 is the support for GLSL 3.0 ES, the latest programming language for the OpenGL shader. With this version of GLSL, a loop in the shader is no longer restricted to a constant integer. Not just that, GLSL 3.0 ES also brings additional matrix operations (like an inverse function) that will make coding a shader much easier. WebGL 2.0 also offers much better support for textures. With version 2.0, the non-power of 2D textures are finally supported, which means the size of your texture image is no longer limited to 32, 64, 128, 256, and such. 3D textures are also supported now, which is pretty useful for volumetric effects such as light rays and smoke, as well as for storing medical scans. WebGL 2.0 also adds support for more texture formats such as RGBA32, RGBA16, R11F_G11F_B10F, SRGB8, and others. More compressed texture formats that are not platform-specific are also supported, including: COMPRESSED_RGB8_ETC2, COMPRESSED_RGBA8_ETC2_EAC, and more. There are other additions to WebGL 2.0, such as Multiple Draw Buffer, Transform Feedback, Uniform Buffer Object, and more. To learn about these and much more, see the official WebGL 2.0 specifications to check out all those additions in detail. About this author Raka Mahesa is a game developer at Chocoarts: http://chocoarts.com/, who is interested in digital technology in general. Outside of work hours, he likes to work on his own projects, with Corridoom VR being his latest released game. Raka also regularly tweets as @legacy99.

0
0
17981

article-image-best-machine-learning-datasets-for-beginners

Natasha Mathur

19 Sep 2018

13 min read

Best Machine Learning Datasets for beginners

Natasha Mathur

19 Sep 2018

13 min read

“It’s not who has the best algorithm that wins. It’s who has the most data” ~ Andrew Ng If you would look at the way algorithms were trained in Machine Learning, five or ten years ago, you would notice one huge difference. Training algorithms in Machine Learning are much better and efficient today than it used to be a few years ago. All credit goes to the hefty amount of data that is available to us today. But, how does Machine Learning make use of this data? Let’s have a look at the definition of Machine Learning. “Machine Learning provides computers or machines the ability to automatically learn from experience without being explicitly programmed”. Machines “learn from experience” when they’re trained, this is where data comes into the picture. How’re they trained? Datasets! This is why it is so crucial that you feed these machines with the right data for whatever problem it is that you want these machines to solve. Why datasets matter in Machine Learning? The simple answer is because Machines too like humans are capable of learning once they see relevant data. But where they vary from humans is the amount of data they need to learn from. You need to feed your machines with enough data in order for them to do anything useful for you. This why Machines are trained using massive datasets. We can think of machine learning data like a survey data, meaning the larger and more complete your sample data size is, the more reliable your conclusions will be. If the data sample isn’t large enough then it won’t be able to capture all the variations making your machine reach inaccurate conclusions, learn patterns that don’t really exist, or not recognize patterns that do. Datasets help bring the data to you. Datasets train the model for performing various actions. They model the algorithms to uncover relationships, detect patterns, understand complex problems as well as make decisions. Apart from using datasets, it is equally important to make sure that you are using the right dataset, which is in a useful format and comprises all the meaningful features, and variations. After all, the system will ultimately do what it learns from the data. Feeding right data into your machines also assures that the machine will work effectively and produce accurate results without any human interference required. For instance, training a speech recognition system with a textbook English dataset will result in your machine struggling to understand anything but textbook English. So, any loose grammar, foreign accents, or speech disorders would get missed out. For such a system, using a dataset comprising all the infinite variations in a spoken language among speakers of different genders, ages, and dialects would be a right option. So keep in mind that it is important that the quality, variety, and quantity of your training data is not compromised as all these factors help determine the success of your machine learning models. Top Machine Learning Datasets for Beginners Now, there are a lot of datasets available today for use in your ML applications. It can be confusing, especially for a beginner to determine which dataset is the right one for your project. It is better to use a dataset which can be downloaded quickly and doesn’t take much to adapt to the models. Further, always use standard datasets that are well understood and widely used. This lets you compare your results with others who have used the same dataset to see if you are making progress. You can pick the dataset you want to use depending on the type of your Machine Learning application. Here’s a rundown of easy and the most commonly used datasets available for training Machine Learning applications across popular problem areas from image processing to video analysis to text recognition to autonomous systems. Image Processing There are many image datasets to choose from depending on what it is that you want your application to do. Image processing in Machine Learning is used to train the Machine to process the images to extract useful information from it. For instance, if you’re working on a basic facial recognition application then you can train it using a dataset that has thousands of images of human faces. This is how Facebook knows people in group pictures. This is also how image search works in Google and in other visual search based product sites. Dataset Name Brief Description 10k US Adult Faces Database This database consists of 10,168 natural face photographs and several measures for 2,222 of the faces, including memorability scores, computer vision, and psychological attributes. The face images are JPEGs with 72 pixels/in resolution and 256-pixel height. Google's Open Images Open Images is a dataset of 9 million URLs to images which have been annotated with labels spanning over 6000 categories. These labels cover more real-life entities and the images are listed as having a Creative Commons Attribution license. Visual Genome This is a dataset of over 100k images densely annotated with numerous region descriptions ( girl feeding elephant), objects (elephants), attributes(large), and relationships (feeding). Labeled Faces in the Wild This database comprises more than 13,000 images of faces collected from the web. Each face is labeled with the name of the person pictured. Fun and easy ML application ideas for beginners using image datasets: Cat vs Dogs: Using Cat and Stanford Dogs dataset to classify whether an image contains a dog or a cat. Iris Flower classification: You can build an ML project using Iris flower dataset where you classify the flowers in any of the three species. What you learn from this toy project will help you learn to classify physical attributes based content to build some fun real-world projects like fraud detection, criminal identification, pain management ( eg; ePAT which detects facial hints of pain using facial recognition technology), and so on. Hot dog - Not hot dog: Use the Food 101 dataset, to distinguish different food types as a hot dog or not. Who knows, you could end up becoming the next Emmy award nominee! Sentiment Analysis As a beginner, you can create some really fun applications using Sentiment Analysis dataset. Sentiment Analysis in Machine Learning applications is used to train machines to analyze and predict the emotion or sentiment associated with a sentence, word, or a piece of text. This is used in movie or product reviews often. If you are creative enough, you could even identify topics that will generate the most discussions using sentiment analysis as a key tool. Dataset Name Brief Description Sentiment140 A popular dataset, which uses 160,000 tweets with emoticons pre-removed Yelp Reviews An open dataset released by Yelp, contains more than 5 million reviews on Restaurants, Shopping, Nightlife, Food, Entertainment, etc. Twitter US Airline Sentiment Twitter data on US airlines starting from February 2015, labeled as positive, negative, and neutral tweets. Amazon reviews This dataset contains over 35 million reviews from Amazon spanning 18 years. Data include information on products, user ratings, and the plaintext review. Easy and Fun Application ideas using Sentiment Analysis Dataset: Positive or Negative: Using Sentiment140 dataset in a model to classify whether given tweets are negative or positive. Happy or unhappy: Using Yelp Reviews dataset in your project to help machine figure out whether the person posting the review is happy or unhappy. Good or Bad: Using Amazon Reviews dataset, you can train a machine to figure out whether a given review is good or bad. Natural Language Processing Natural language processing deals with training machines to process and analyze large amounts of natural language data. This is how search engines like Google know what you are looking for when you type in your search query. Use these datasets to make a basic and fun NLP application in Machine Learning: Dataset Name Brief Description Speech Accent Archive This dataset comprises 2140 speech samples from different talkers reading the same reading passage. These Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English. Wikipedia Links data This dataset consists of almost 1.9 billion words from more than 4 million articles. Search is possible by word, phrase or part of a paragraph itself. Blogger Corpus A dataset comprising 681,288 blog posts gathered from blogger.com. Each blog consists of minimum 200 occurrences of commonly used English words. Fun Application ideas using NLP datasets: Spam or not: Using Spambase dataset, you can enable your application to figure out whether a given email is spam or not. Video Processing Video Processing datasets are used to teach machines to analyze and detect different settings, objects, emotions, or actions and interactions in videos. You’ll have to feed your machine with a lot of data on different actions, objects, and activities. Dataset Name Brief Description UCF101 - Action Recognition Data Set This dataset comes with 13,320 videos from 101 action categories. Youtube 8M YouTube-8M is a large-scale labeled video dataset. It contains millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. Fun Application ideas using video processing dataset: Action detection: Using UCF101 - Action Recognition DataSet, or Youtube 8M, you can train your application to detect the actions such as walking, running etc, in a video. Speech Recognition Speech recognition is the ability of a machine to analyze or identify words and phrases in a spoken language. Feed your machine with the right and good amount of data, and it will help it in the process of recognizing speech. Combine speech recognition with natural language processing, and get Alexa who knows what you need. Dataset Name Brief Description Gender Recognition by Voice and speech analysis This database identifies a voice as male or female, depending on the acoustic properties of voice and speech. The dataset contains 3,168 recorded voice samples, collected from male and female speakers. Human Activity Recognition w/Smartphone Human Activity Recognition database consists of recordings of 30 subjects performing activities of daily living (ADL) while carrying a smartphone ( Samsung Galaxy S2 ) on the waist. TIMIT TIMIT provides speech data for acoustic-phonetic studies and for the development of automatic speech recognition systems. It comprises broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences, phonetic and word transcriptions. Speech Accent Archive This dataset contains 2140 speech samples, each from a different talker reading the same reading passage. Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English. Fun Application ideas using Speech Recognition dataset: Accent detection: Use Speech Accent Archive dataset, to make your application identify different accents from a given sample of accents. Identify the activity: Use Human Activity Recognition w/Smartphone dataset to help your application detect the human activity. Natural Language Generation Natural Language generation refers to the ability of machines to simulate the human speech. It can be used to translate written information into aural information or assist the vision-impaired by reading out aloud the contents of a display screen. This is how Alexa or Siri respond to you. Dataset Name Brief Description Common Voice by Mozilla Common Voice dataset contains speech data read by users on the Common Voice website from a number of public sources like user-submitted blog posts, old books, movies, etc. LibriSpeech This dataset consists of nearly 500 hours of clean speech of various audiobooks read by multiple speakers, organized by chapters of the book with both the text and the speech. Fun Application ideas using Natural Language Generation dataset: Converting text into Audio: Using Blogger Corpus dataset, you can train your application to read out loud the posts on blogger. Autonomous Driving Build some basic self-driving Machine Learning Applications. These Self-driving datasets will help you train your machine to sense its environment and navigate accordingly without any human interference. Autonomous cars, drones, warehouse robots, and others use these algorithms to navigate correctly and safely in the real world. Datasets are even more important here as the stakes are higher and the cost of a mistake could be a human life. Dataset Name Brief Description Berkeley DeepDrive BDD100k This is one of the largest datasets for self-driving AI currently. It comprises over 100,000 videos of over 1,100-hour driving experiences across different times of the day and weather conditions. Baidu Apolloscapes Large dataset consisting of 26 different semantic items such as cars, bicycles, pedestrians, buildings, street lights, etc. Comma.ai This dataset consists of more than 7 hours of highway driving. It includes details on car’s speed, acceleration, steering angle, and GPS coordinates. Cityscape Dataset This is a large dataset that contains recordings of urban street scenes in 50 different cities. nuScenes This dataset consists of more than 1000 scenes with around 1.4 million image, 400,000 sweeps of lidars (laser-based systems that detect the distance between objects), and 1.1 million 3D bounding boxes ( detects objects with a combination of RGB cameras, radar, and lidar). Fun Application ideas using Autonomous Driving dataset: A basic self-driving application: Use any of the self-driving datasets mentioned above to train your application with different driving experiences for different times and weather conditions. IoT Machine Learning in building IoT applications is on the rise these days. Now, as a beginner in Machine Learning, you may not have advanced knowledge on how to build these high-performance IoT applications using Machine Learning, but you certainly can start off with some basic datasets to explore this exciting space. Dataset Name Brief Description Wayfinding, Path Planning, and Navigation Dataset This dataset consists of samples of trajectories in an indoor building (Waldo Library at Western Michigan University) for navigation and wayfinding applications. ARAS Human Activity Dataset This dataset is a Human activity recognition Dataset collected from two real houses. It involves over 26 millions of sensor readings and over 3000 activity occurrences. Fun Application ideas using IoT dataset: Wearable device to track human activity: Use the ARAS Human Activity Dataset to train a wearable device to identify human activity. Read Also: 25 Datasets for Deep Learning in IoT Once you’re done going through this list, it’s important to not feel restricted. These are not the only datasets which you can use in your Machine Learning Applications. You can find a lot many online which might work best for the type of Machine Learning Project that you’re working on. Some popular sources of a wide range of datasets are Kaggle, UCI Machine Learning Repository, KDnuggets, Awesome Public Datasets, and Reddit Datasets Subreddit. With all this information, it is now time to use these datasets in your project. In case you’re completely new to Machine Learning, you will find reading, ‘A nonprogrammer’s guide to learning Machine learning’quite helpful. Regardless of whether you’re a beginner or not, always remember to pick a dataset which is widely used, and can be downloaded quickly from a reliable source. How to create and prepare your first dataset in Salesforce Einstein Google launches a Dataset Search Engine for finding Datasets on the Internet Why learn machine learning as a non-techie?

0
0
17906

article-image-python-web-development-django-vs-flask-2018

Aaron Lazar

28 May 2018

7 min read

Python web development: Django vs Flask in 2018

Aaron Lazar

28 May 2018

7 min read

A colleague of mine, wrote an article over two years ago, talking about the two top Python web frameworks, Django and Flask. It’s 2018 now, and a lot has changed in the IT world. There have been a couple of frameworks that emerged or gained popularity in the last 3 years, like Bottle or CherryPy, for example. However, Django and Flask have managed to stand their ground and have continued to remain as the top 2 Python frameworks. Moreover, there have been some major breakthroughs in web application architecture like the rise of Microservices, that has in turn pushed the growth of newer architectures like Serverless and Cloud-Native. I thought it would be a good idea to present a more modern comparison of these two frameworks, to help you take an informed decision on which one you should be choosing for your application development. So before we dive into ripping these frameworks apart, let’s briefly go over a few factors we’ll be considering while evaluating them. Here’s what I got in mind, in no particular order: Ease of use Popularity Community support Job market Performance Modern Architecture support Ease of use This is something l like to cover first, because I know it’s really important for developers who are just starting out, to assess the learning curve before they attempt to scale it. When I’m talking about ease of use, I’m talking about how easy it is to get started with using the tool in your day to day projects. Flask, like it’s webpage, is a very simple tool to learn, simply because it’s built to be simple. Moreover, the framework is un-opinionated, meaning that it will allow you to implement things the way you choose to, without throwing a fuss. This is really important when you’re starting out. You don’t want to run into too much issues that will break your confidence as a developer. On the other hand, Django is a great framework to learn too. While several Python developers will disagree with me, I would say Django is a pretty complex framework, especially for a newbie. Now this is not all that bad, right. I mean, especially when you’re building a large project, you want to be the one holding the reins. If you’re starting out with some basic projects then, it may be wise not to choose Django. The way I see it, learning Flask first will allow you to learn Django much faster. Popularity Both frameworks are quite popular, with Django starring at 34k on Github, and Flask having a slight edge at 36k. If you take a look at the Google trends, both tools follow a pretty similar trend, with Django’s volume much higher, owing to its longer existence. Source: SEM Rush As mentioned before, Flask is more popular among beginners and those who want to build basic websites easily. On the other hand, Django is more popular among the professionals who have years of experience building robust websites. Community Support and Documentation In terms of community support, we’re looking at how involved the community is, in developing the tool and providing support to those who need it. This is quite important for someone who’s starting out with a tool, or for that matter, when there’s a new version releasing and you need to keep yourself up to date.. Django features 170k tags on Stackoverflow, which is over 7 times that of Flask, which stands at 21k. Although Django is a clear winner in terms of numbers, both mailing lists are quite active and you can receive all the help you need, quite easily. When it comes to documentation, Django has some solid documentation that can help you get up and running in no time. On the other hand, Flask has good documentation too, but you usually have to do some digging to find what you’re looking for. Job Scenes Jobs are really important especially if you’re looking for a corporate one It’s quite natural that the organization that’s hiring you will already be working with a particular stack and they will expect you to have those skills before you step in. Django records around 2k jobs on Indeed in the USA, while Flask records exactly half that amount. A couple of years ago, the situation was pretty much the same; Django was a prime requirement, while Flask had just started gaining popularity. You’ll find a comment stating that “Picking up Flask might be a tad easier then Django, but for Django you will have more job openings” Itjobswatch.uk lists Django as the 2nd most needed Skill for a Python Developer, whereas Flask is way down at 20. Source: itjobswatch.uk Clearly Django is in more demand that Flask. However, if you are an independent developer, you’re still free to choose the framework you wish to work with. Performance Honestly speaking, Flask is a microframework, which means it delivers a much better performance in terms of speed. This is also because in Flask, you could write 10k lines of code, for something that would take 24k lines in Django. Response time comparison for data from remote server: Django vs Flask In the above image we see how both tools perform in terms of loading a response from the server and then returning it. Both tools are pretty much the same, but Flask has a slight edge over Django. Load time comparison from database with ORM: Django vs Flask In this image, we see how the gap between the tools is quite large, with Flask being much more efficient in loading data from the database. When we talk about performance, we also need to consider the power each framework provides you when you want to build large apps. Django is a clear winner here, as it allows you to build massive, enterprise grade applications. Django serves as a full-stack framework, which can easily be integrated with JavaScript to build great applications. On the other hand, Flask is not suitable for large applications. The JetBrains Python Developer Survey revealed that Django was a more preferred option among the respondents. Jetbrains Python Developer Survey 2017 Modern Architecture Support The monolith has been broken and microservices have risen. What’s interesting is that although applications are huge, they’re now composed of smaller services working together to make up the actual application. While you would think Django would be a great framework to build microservices, it turns out that Flask serves much better, thanks to its lightweight architecture and simplicity. While you work on a huge enterprise application, you might find Flask being interwoven wherever a light framework works best. Here’s the story of one company that ditched Django for microservices. I’m not going to score these tools because they’re both awesome in their own right. The difference arises when you need to choose one for your projects and it’s quite evident that Flask should be your choice when you’re working on a small project or maybe a smaller application built into a larger one, maybe a blog or a small website or a web service. Although, if you’re on the A team, making a super awesome website for maybe, Facebook or a billion dollar enterprise, instead of going the Django unchained route, choose Django with a hint of Flask added in, for all the right reasons. :) Django recently hit version 2.0 last year, while Flask hit version 1.0 last month. Here’s some great resources to get you up and running with Django and Flask. So what are you waiting for? Go build that website! Why functional programming in Python matters Should you move to Python 3.7 Why is Python so good for AI and Machine Learning?

0
0
17718

article-image-dl-frameworks-tensorflow-vs-cntk

Aaron Lazar

30 Oct 2017

6 min read

The Deep Learning Framework Showdown: TensorFlow vs CNTK

Aaron Lazar

30 Oct 2017

6 min read

The question several Deep Learning engineers may ask themselves is: Which is better, TensorFlow or CNTK? Well, we're going to answer that question for you, taking you through a closely fought match between the two most exciting frameworks. So, here we are, ladies and gentlemen, it's fight night and it's a full house. In the Red corner, weighing in at two hundred and seventy pounds of Python and topping out at over ten thousand frames per second; managed by the American tech giant, Google; we have the mighty, the beefy, TensorFlow! In the Blue corner, weighing in at two hundred and thirty pounds of C++ muscle, we have, one of the top toolkits that can comfortably scale beyond a single machine. Managed by none other than Microsoft, it's fast, it's furious, it's CNTK aka the Microsoft Cognitive Toolkit! And we're into Round One… TensorFlow and CNTK are looking quite menacingly at each other and are raging to take down their opponents. TensorFlow seems pleased that its compile times are considerably faster than its successor, Theano. Although, it looks like happiness came a tad bit soon. CNTK, light and bouncy on it's feet, comes straight out of nowhere with a whopping seventy thousand frames/second upper cut, knocking TensorFlow to the floor. TensorFlow looks like it's in no mood to give up anytime soon. It makes itself so simple to use and understand that even students can pick it up and start training their own models. This isn't the case with CNTK, as it begs to shed its complexity. On the other hand, CNTK seems to be thrashing TensorFlow in terms of 3D convolution, where CNTK can clearly recognize images from streaming content. TensorFlow also tries its best to run LSTM RNNs, but in vain. The crowd keeps cheering on… Wait a minute...are they calling out for TensorFlow? Yes they are! There's hardly any cheering for CNTK. This is embarrassing! Looks like its community support can't match up to TensorFlow's. And ladies and gentlemen, that does make a difference - we can see TensorFlow improving on several fronts and gradually getting back in the game! TensorFlow huffs and puffs as it tries to prove that it's not just about deep learning and that it has tools in the pocket that can support other algorithms such as reinforcement learning. It conveniently whips out the TensorBoard, and drops CNTK to the floor with its beautiful visualizations. TensorFlow now has the upper hand and is trying hard to pin CNTK to the floor and tries to use its R support to finish it off. But CNTK tactfully breaks loose and leaves TensorFlow on the floor - still not ready to be used in production. And there goes the bell for Round One! Both fighters look exhausted but you can see a faint twinkle in TensorFlow's eye, primarily because it survived Round One. Google seems to be working hard to prep it for Round Two and is making several improvements in terms of speed, flexibility and majorly making it ready for production. Meanwhile, Microsoft boosts CNTK's spirits with a shot of Python APIs in its blood. As it moves towards reaching version 2.0, there are a lot of improvements to CNTK, wherein, Microsoft has ensured that it's not left behind, like having a backend for Keras, which puts it on par with TensorFlow. Moreover, there seem to be quite a few experimental features that it looks ready to enter the ring with, like the Java API for example. It's the final round and boy, are these two into a serious stare-down! The referee waves them in and off they are. CNTK needs to get back at TensorFlow. Comfortably supporting multiple GPUs and CPUs out of the box, across both the Microsoft and Linux platforms, it has an advantage over TensorFlow. Is it going to use that trump card? Yes it is! A thousand GPUs and a hundred machines in, and CNTK is raining blows on TensorFlow. TensorFlow clearly drops the ball when it comes to multiple machines, and it rather complicates things. It's high time that TensorFlow turned the tables. Lo and behold! It shows off its mobile deep learning capabilities with TensorFlow Lite, clearly flipping CNTK flat on its back. This is revolutionary and a tremendous breakthrough for TensorFlow! CNTK, however, is clearly the people's choice when it comes to language compatibility. With support for C++, Python, C#/.NET and now Java, it's clearly winning in this area. Round Two is coming to an end, ladies and gentlemen and it's a neck to neck battle out there. We're not sure the judges are going to be able to choose a clear winner, from the looks of it. And…. there goes the bell! While the scores are being tallied, we go over to the teams and some spectators for some gossip on the what's what of deep learning. Did you know having multiple machine support is a huge advantage? It increases speed and efficiency by almost 10 times! That's something! We also got to know that TensorFlow is training hard and is picking up positives from its rival, CNTK. There are also rumors about a new kid called MXNet (read about it here), that has APIs in R, Python and even in Julia! This makes it one helluva framework in terms of flexibility and speed. In fact, AWS is already implementing it while Apple also is rumored to be using it. Clearly, something to watch out for. And finally, the judges have made their decision. Ladies and gentlemen, after two rounds of sheer entertainment, we have the results... TensorFlow CNTK Processing speed 0 1 Learning curve 1 0 Production readiness 0 1 Community support 1 0 CPU, GPU computation support 0 1 Mobile deep learning 1 0 Multiple language compatibility 0 1 It's a unanimous decision and just as we thought, CNTK is the heavyweight champion! CNTK clearly beat TensorFlow in terms of performance, because of its flexibility, speed and ability to use in production! As a Deep Learning engineer, should you be wanting to use one of these frameworks in your tasks, you should check out their features thoroughly, test them out with a test dataset and then implement them to your actual data. After all, it's the choices we make that define a win or a loss - simplicity over resource utilisation, or speed over platform, we must choose our tools wisely. For more information on the kind of tests that both the tools have been put through, read the Research Paper presented by Shaohuai Shi, Qiang Wang, Pengfei Xu and Xiaowen Chu from the Department of Computer Science, Hong Kong Baptist University and these benchmarks.

0
1
17349

article-image-what-makes-functional-programming-a-viable-choice-for-artificial-intelligence-projects

Prasad Ramesh

14 Sep 2018

7 min read

What makes functional programming a viable choice for artificial intelligence projects?

Prasad Ramesh

14 Sep 2018

7 min read

0
0
17228

article-image-ai-rescue-5-ways-machine-learning-can-assist-emergency-situations

Sugandha Lahoti

16 Jan 2018

9 min read

AI to the rescue: 5 ways machine learning can assist during emergency situations

Sugandha Lahoti

16 Jan 2018

9 min read

At the wee hours of the night, on January 4 this year, over 9.8 million people experienced a magnitude 4.4 earthquake that rumbled across the San Francisco Bay Area. This was followed by a magnitude 7.6 earthquake in the Caribbean sea on January 9, following which a tsunami advisory was in effect for Puerto Rico and the U.S. and British Virgin Islands. In the past 6 months, the United States alone has witnessed four back-to-back storms from one brutal hurricane season, and a massive wildfire with almost 2 million acres of land ablaze. Natural Disasters across the globe are increasingly becoming more damaging and frequent. Since 1970, the number of disasters worldwide has more than quadrupled to around 400 a year. These series of natural disasters have strained the emergency services and disaster relief operations beyond capacity. We now need to look for newer ways to assist the affected people and automate the recovery process. Artificial intelligence and machine learning have advanced to the state where they are highly proficient in making predictions, and in identification and classification tasks. These use cases of AI can also be applied to prevent disasters or respond quickly, in case of an emergency. Here are 5 ways how AI can lend a helping hand during emergency situations. 1. Machine Learning for targeted disaster relief management In case of any disaster, the first step is to formulate a critical response team to help those in distress. Before the team goes into action, it is important to analyze and assess the extent of damage and to ensure that the right aid goes first to those who need it the most. AI techniques such as image recognition and classification can be quite helpful in assessing the damage as they can analyze and observe images from the satellites. They can immediately and efficiently filter these images, which would have required months to be sorted manually. AI can identify objects and features such as damaged buildings, flooding, blocked roads from these images. They can also identify temporary settlements which may indicate that people are homeless, and so the first care could be directed towards them. Artificial intelligence and machine learning tools can also aggregate and crunch data from multiple resources such as crowd-sourced mapping materials or Google maps. Machine learning approaches then combine all this data together, remove unreliable data, and identify informative sources to generate heat maps. These heat maps can identify areas in need of urgent assistance and direct relief efforts to those areas. Heat maps are also helpful for government and other humanitarian agencies in deciding where to conduct aerial assessments. DigitalGlobe provides space imagery and geospatial content. Their Open Data Program is a special program for disaster response. The software learns how to recognize buildings on satellite photos by learning from the crowd. DigitalGlobe releases pre- and post-event imagery for select natural disasters each year, and their crowdsourcing platform, Tomnod, will prioritize micro-tasking to accelerate damage assessments. Following the Nepal Earthquakes in 2015, Rescue Global and academicians from the Orchid Project used machine learning to carry out rescue activities. They took pre and post-disaster imagery and utilized crowd-sourced data analysis and machine learning to identify locations affected by the quakes that had not yet been assessed or received aid. This information was then shared with relief workforces to facilitate their activities. 2. Next Generation 911 911 is the first source of contact during any emergency situation. 911 dispatch centers are already overloaded with calls on a regular day. In case of a disaster or calamity, the number gets quadrupled, or even more. This calls for augmenting traditional 911 emergency centers with newer technologies for better management. Traditional 911 centers rely on voice-based calls alone. Next-gen dispatch services are upgrading their emergency dispatch technology with machine learning to receive more types of data. So now they can ingest the data from not just calls but also from text, video, audio, and pictures, to analyze them to make quick assessments. The insights gained from all this information can be passed on to the emergency response teams out in the field to efficiently carry out critical tasks. The Association of Public-Safety Communications (APCO) have employed IBM’s Watson to listen to 911 calls. This initiative is to help emergency call centers improve operations and public safety by using Watson’s speech-to-text and analytics programs. Using Watson's speech-to-text function, the context of each call is fed into the AI's analytics program allowing improvements in how call centers respond to emergencies. It also helps in reducing call times, provide accurate information, and help accelerate time-sensitive emergency services. 3. Sentiment analysis on social media data for disaster management and recovery Social media channels are a major source of news in present times. Some of the most actionable information, during a disaster, comes from social media users. Real-time images and comments from Facebook, Twitter, Instagram, and YouTube can be analyzed and validated by AI to filter real information from fake ones. These vital stats can help on-the-ground aid workers to reach the point of crisis sooner and direct their efforts to the needy. This data can also help rescue workers in reducing the time needed to find victims. In addition, AI and predictive analytics software can analyze digital content from Twitter, Facebook, and Youtube to provide early warnings, ground-level location data, and real-time report verification. In fact, AI could also be used to view the unstructured data and background of pictures and videos posted to social channels and compare them to find missing people. AI-powered chatbots can help residents affected by a calamity. The chatbot can interact with the victim, or other citizens in the vicinity via popular social media channels and ask them to upload information such as location, a photo, and some description. The AI can then validate and check this information from other sources and pass on the relevant details to the disaster relief committee. This type of information can assist them with assessing damage in real time and help prioritize response efforts. AI for Digital Response (AIDR) is a free and open platform which uses machine intelligence to automatically filter and classify social media messages related to emergencies, disasters, and humanitarian crises. For this, it uses a Collector and a Tagger. The Collector helps in collecting and filtering tweets using keywords and hashtags such as "cyclone" and "#Irma," for example. The Collector works as a word-filter. The Tagger is a topic-filter which classifies tweets by topics of interest, such as "Infrastructure Damage," and "Donations," for example. The Tagger automatically applies the classifier to incoming tweets collected in real-time using the Collector. 4. AI answers distress and help-calls Emergency relief services are flooded with distress and help calls in the event of any emergency situation. Managing such a huge amount of calls is time-consuming and expensive when done manually. The chances of a critical information being lost or unobserved is also a possibility. In such cases, AI can work as a 24/7 dispatcher. AI systems and voice assistants can analyze massive amounts of calls, determine what type of incident occurred and verify the location. They can not only interact with callers naturally and process those calls, but can also instantly transcribe and translate languages. AI systems can analyze the tone of voice for urgency, filtering redundant or less urgent calls and prioritizing them based on the emergency. Blueworx is a powerful IVR platform which uses AI to replace call center officials. Using AI technology is especially useful when unexpected events such as natural disasters drive up call volume. Their AI engine is well suited to respond to emergency calls as unlike a call center agent, it can know who a customer is even before they call. It also provides intelligent call routing, proactive outbound notifications, unified messaging, and Interactive Voice Response. 5. Predictive analytics for proactive disaster management Machine learning and other data science approaches are not limited to assisting the on-ground relief teams or assisting only after the actual emergency. Machine learning approaches such as predictive analytics can also analyze past events to identify and extract patterns and populations vulnerable to natural calamities. A large number of supervised and unsupervised learning approaches are used to identify at-risk areas and improve predictions of future events. For instance, clustering algorithms can classify disaster data on the basis of severity. They can identify and segregate climatic patterns which may cause local storms with the cloud conditions which may lead to a widespread cyclone. Predictive machine learning models can also help officials distribute supplies to where people are going, rather than where they were by analyzing real-time behavior and movement of people. In addition, predictive analytics techniques can also provide insight for understanding the economic and human impact of natural calamities. Artificial neural networks take in information such as region, country, and natural disaster type to predict the potential monetary impact of natural disasters. Recent advances in cloud technologies and numerous open source tools have enabled predictive analytics with almost no initial infrastructure investment. So agencies with limited resources can also build systems based on data science and develop more sophisticated models to analyze disasters. Optima Predict, a suite of software by Intermedix collects and reads information about disasters such as viral outbreaks or criminal activity in real time. The software spots geographical clusters of reported incidents before humans notice the trend and then alerts key officials about it. The data can also be synced with FirstWatch, which is an online dashboard for an EMS (Emergency Medical Services) personnel. Thanks to the multiple benefits of AI, government agencies and NGOs can start utilizing machine learning to deal with disasters. As AI and allied fields like robotics further develop and expand, we may see a fleet of drone services, equipped with sophisticated machine learning. These advanced drones could expedite access to real-time information at disaster sites using video capturing capabilities and also deliver lightweight physical goods to hard to reach areas. As with every progressing technology, AI will also build on its existing capabilities. It has the potential to eliminate outages before they are detected and give disaster response leaders an informed, clearer picture of the disaster area, ultimately saving lives.

0
0
17112

article-image-iterative-machine-learning-step-towards-model-accuracy

Amarabha Banerjee

01 Dec 2017

10 min read

Iterative Machine Learning: A step towards Model Accuracy

Amarabha Banerjee

01 Dec 2017

10 min read

Learning something by rote i.e., repeating it many times, perfecting a skill by practising it over and over again or building something by making minor adjustments progressively to a prototype are things that comes to us naturally as human beings. Machines can also learn this way and this is called ‘Iterative machine learning’. In most cases, iteration is an efficient learning approach that helps reach the desired end results faster and accurately without becoming a resource crunch nightmare. Now, you might wonder, isn’t iteration inherently part of any kind of machine learning? In other words, modern day machine learning techniques across the spectrum from basic regression analysis, decision trees, Bayesian networks, to advanced neural nets and deep learning algorithms have some inherent iterative component built into them. What is the need, then, for discussing iterative learning as a standalone topic? This is simply because introducing iteration externally to an algorithm can minimize the error margin and therefore help in accurate modelling. How Iterative Learning works Let’s understand how iteration works by looking closely at what happens during a single iteration flow within a machine learning algorithm. A pre-processed training dataset is first introduced into the model. After processing and model building with the given data, the model is tested, and then the results are matched with the desired result/expected output. The feedback is then returned back to the system for the algorithm to further learn and fine tune its results. This clearly shows that two iteration processes take place here: Data Iteration - Inherent to the algorithm Model Training Iteration - Introduced externally Now, what if we did not feedback the results into the system i.e. did not allow the algorithm to learn iteratively but instead adopted a sequential approach? Would the algorithm work and would it provide the right results? Yes, the algorithm would definitely work. However, the quality of the results it produces is going to vary vastly based on a number of factors. The quality and quantity of the training dataset, the feature definition and extraction techniques employed, the robustness of the algorithm itself are among many other factors. Even if all of the above were done perfectly, there is still no guarantee that the results produced by a sequential approach will be highly accurate. In short, the results will neither be accurate nor reproducible. Iterative learning thus allows algorithms to improve model accuracy. Certain algorithms have iteration central to their design and can be scaled as per the data size. These algorithms are at the forefront of machine learning implementations because of their ability to perform faster and better. In the following sections we will discuss iteration in different sets of algorithms each from the three main machine learning approaches - supervised ML, unsupervised ML and reinforcement learning. The Boosting algorithms: Iteration in supervised ML The boosting algorithms, inherently iterative in nature, are a brilliant way to improve results by minimizing errors. They are primarily designed to reduce bias in results and transform a particular set of weak learning classifier algorithms to strong learners and to enable them to reduce errors. Some examples are: AdaBoost (Adaptive Boosting) Gradient Tree Boosting XGBoost How they work All boosting algorithms have a common classifiers which are iteratively modified to reach the desired result. Let’s take the example of finding cases of plagiarism in a certain article. The first classifier here would be to find a group of words that appear somewhere else or in another article which would result in a red flag. If we create 10 separate group of words and term them as classifiers 1 to 10, then our article will be checked on the basis of this classifier and any possible matches will be red flagged. But no red flags with these 10 classifiers would not mean a definite 100% original article. Thus, we would need to update the classifiers, create shorter groups perhaps based on the first pass and improve the accuracy with which the classifiers can find similarity with other articles. This iteration process in Boosting algorithms eventually leads us to a fairly high rate of accuracy. The reason being after each iteration, the classifiers are updated based on their performance. The ones which have close similarity with other content are updated and tweaked so that we can get a better match. This process of improving the algorithm inherently, is termed as boosting and is currently one of the most popular methods in Supervised Machine Learning. Strengths & weaknesses The obvious advantage of this approach is that it allows minimal errors in the final model as the iteration enables the model to correct itself every time there is an error. The downside is the higher processing time and the overall memory requirement for a large number of iterations. Another important aspect is that the error fed back to train the model is done externally, which means the supervisor has control over the model and how it modifies. This in turn has a downside that the model doesn’t learn to eliminate error on its own. Hence, the model is not reusable with another set of data. In other words, the model does not learn how to become error-free by itself and hence cannot be ported to another dataset as it would need to start the learning process from scratch. Artificial Neural Networks: Iteration in unsupervised ML Neural Networks have become the poster child for unsupervised machine learning because of their accuracy in predicting data models. Some well known neural networks are: Convolutional Neural Networks Boltzmann Machines Recurrent Neural Networks Deep Neural Networks Memory Networks How they work Artificial neural networks are highly accurate in simulating data models mainly because of their iterative process of learning. But this process is different from the one we explored earlier for Boosting algorithms. Here the process is seamless and natural and in a way it paves the way for reinforcement learning in AI systems. Neural Networks consist of electronic networks simulating the way the human brain is works. Every network has an input and output node and in-between hidden layers that consist of algorithms. The input node is given the initial data set to perform a set of actions and each iteration creates a result that is output as a string of data. This output is then matched with the actual result dataset and the error is then fed back to the input node. This error then enables the algorithms to correct themselves and reach closer and closer to the actual dataset. This process is called training the Neural Networks and each iteration improve the accuracy. The key difference between the iteration performed here as compared to how it is performed by Boosting algorithms is that here we don’t have to update the classifiers manually, the algorithms change themselves based on the error feedback. Strengths & weaknesses The main advantage of this process is obviously the level of accuracy that it can achieve on its own. The model is also reusable because it learns the means to achieve accuracy and not just gives you a direct result. The flip side of this approach is that the models can go wrong heavily and deviate completely in a different direction. This is because the induced iteration takes its own course and doesn’t need human supervision. The facebook chat-bots deviating from their original goal and communicating within themselves in a language of their own is a case in point. But as is the saying, smart things come with their own baggage. It’s a risk we would have to be ready to tackle if we want to create more accurate models and smarter systems. Reinforcement Learning Reinforcement learning is a interesting case of machine learning where the simple neural networks are connected and together they interact with the environment to learn from their mistakes and rewards. The iteration introduced here happens in a complex way. The iteration happens in the form of reward or punishment for arriving at the correct or wrong results respectively. After each interaction of this kind, the multilayered neural networks incorporate the feedback, and then recreate the models for better accuracy. The typical type of reward and punishment method somewhat puts it in a space where it is neither supervised nor unsupervised, but exhibits traits of both and also has the added advantage of producing more accurate results. The con here is that the models are complex by design. Multilayered neural networks are difficult to handle in case of multiple iterations because each layer might respond differently to a certain reward or punishment. As such it may create inner conflict that might lead to a stalled system - one that can’t decide which direction to move next. Some Practical Implementations of Iteration Many modern day machine learning platforms and frameworks have implemented the iteration process on their own to create better data models, Apache Spark and MapR are two such examples. The way the two implement iteration is technically different and they have their merits and limitations. Let’s look at MapReduce. It reads and writes data directly onto HDFS filesystem present on the disk. Note that for every iteration to be read and written from the disk needs significant time. This in a way creates a more robust and fault tolerant system but compromises on the speed. On the other hand, Apache Spark stores the data in memory (Resilient Distributed DataSet) i.e. in the RAM. As a result, each iteration takes much less time which enables Spark to perform lightning fast data processing. But the primary problem with the Spark way of doing iteration is that dynamic memory or RAM is much less reliable than disk memory to store iteration data and perform complex operations. Hence it’s much less fault tolerant that MapR. Bringing it together To sum up the discussion, we can look at the process of iteration and its stages in implementing machine learning models roughly as follows: Parameter Iteration: This is the first and inherent stage of iteration for any algorithm. The parameters involved in a certain algorithm are run multiple times and the best fitting parameters for the model are finalized in this process. Data Iteration: Once the model parameters are finalized, the data is put into the system and the model is simulated. Multiple sets of data are put into the system to check the parameters’ effectiveness in bringing out the desired result. Hence, if data iteration stage suggests that some of the parameters are not well suited for the model, then they are taken back to the parameter iteration stage and parameters are added or modified. Model Iteration: After the initial parameters and data sets are finalized, the model testing/ training happens. The iteration in model testing phase is all about running the same model simulation multiple times with the same parameters and data set, and then checking the amount of error, if the error varies significantly in every iteration, then there is something wrong with either the data or the parameter or both. Iterations are done to data and parameters until the model achieves accuracy. Human Iteration: This step involves the human induced iteration where different models are put together to create a fully functional smart system. Here, multiple levels of fitting and refitting happens to achieve a coherent overall goal such as creating a driverless car system or a fully functional AI. Iteration is pivotal to creating smarter AI systems in the near future. The enormous memory requirements for performing multiple iterations on complex data sets continue to pose major challenges. But with increasingly better AI chips, storage options and data transfer techniques, these challenges are getting easier to handle. We believe iterative machine learning techniques will continue to lead the transformation of the AI landscape in the near future.

0
0
16716

Aaron Lazar

23 Apr 2018

5 min read

Why is Hadoop dying?

Aaron Lazar

23 Apr 2018

5 min read

Hadoop has been the definitive big data platform for some time. The name has practically been synonymous with the field. But while its ascent followed the trajectory of what was referred to as the 'big data revolution', Hadoop now seems to be in danger. The question is everywhere - is Hadoop dying out? And if it is, why is it? Is it because big data is no longer the buzzword it once was, or are there simply other ways of working with big data that have become more useful? Hadoop was essential to the growth of big data When Hadoop was open sourced in 2007, it opened the door to big data. It brought compute to data, as against bringing data to compute. Organisations had the opportunity to scale their data without having to worry too much about the cost. It obviously had initial hiccups with security, the complexity of querying and querying speeds, but all that was taken care off, in the long run. Still, although querying speeds remained quite a pain, however that wasn’t the real reason behind Hadoop dying (slowly). As cloud grew, Hadoop started falling One of the main reasons behind Hadoop's decline in popularity was the growth of cloud. There cloud vendor market was pretty crowded, and each of them provided their own big data processing services. These services all basically did what Hadoop was doing. But they also did it in an even more efficient and hassle-free way. Customers didn't have to think about administration, security or maintenance in the way they had to with Hadoop. One person’s big data is another person’s small data Well, this is clearly a fact. Several organisations that used big data technologies without really gauging the amount of data they actually would need to process, have suffered. Imagine sitting with 10TB Hadoop clusters when you don’t have that much data. The two biggest organisations that built products on Hadoop, Hortonworks and Cloudera, saw a decline in revenue in 2015, owing to their massive use of Hadoop. Customers weren’t pleased with nature of Hadoop’s limitations. Apache Hadoop v Apache Spark Hadoop processing is way behind in terms of processing speed. In 2014 Spark took the world by storm. I’m going to let you guess which line in the graph above might be Hadoop, and which might be Spark. Spark was a general purpose, easy to use platform that was built after studying the pitfalls of Hadoop. Spark was not bound to just the HDFS (Hadoop Distributed File System) which meant that it could leverage storage systems like Cassandra and MongoDB as well. Spark 2.3 was also able to run on Kubernetes; a big leap for containerized big data processing in the cloud. Spark also brings along GraphX, which allows developers to view data in the form of graphs. Some of the major areas Spark wins are Iterative Algorithms in Machine Learning, Interactive Data Mining and Data Processing, Stream processing, Sensor data processing, etc. Machine Learning in Hadoop is not straightforward Unlike MLlib in Spark, Machine Learning is not possible in Hadoop unless tied with a 3rd party library. Mahout used to be quite popular for doing ML on Hadoop, but its adoption has gone down in the past few years. Tools like RHadoop, a collection of 3 R packages, have grown for ML, but it still is nowhere comparable to the power of the modern day MLaaS offerings from cloud providers. All the more reason to move away from Hadoop, right? Maybe. Hadoop is not only Hadoop The general misconception is that Hadoop is quickly going to be extinct. On the contrary, the Hadoop family consists of YARN, HDFS, MapReduce, Hive, Hbase, Spark, Kudu, Impala, and 20 other products. While e folks may be moving away from Hadoop as their choice for big data processing, they will still be using Hadoop in some form or the other. As with Cloudera and Hortonworks, though the market has seen a downward trend, they’re in no way letting go of Hadoop anytime soon, although they have shifted part of their processing operations to Spark. Is Hadoop dying? Perhaps not... In the long run, it’s not completely accurate to say that Hadoop is dying. December last year brought with it Hadoop 3.0, which is supposed to be a much improved version of the framework. Some of the most noteworthy features are its improved shell script, more powerful YARN, improved fault tolerance with erasure coding, and many more. Although, that hasn’t caused any major spike in adoption, there are still users who will adopt Hadoop based on their use case, or simply use another alternative like Spark along with another framework from the Hadoop family. So, Hadoop’s not going away anytime soon. Read More Pandas is an effective tool to explore and analyze data - Interview insights

0
1
16697

Packt Editorial Staff

29 Mar 2018

4 min read

The evolution of cybercrime

Packt Editorial Staff

29 Mar 2018

4 min read

A history of cybercrime As computer systems have now become integral to the daily functioning of businesses, organizations, governments, and individuals we have learned to put a tremendous amount of trust in these systems. As a result, we have placed incredibly important and valuable information on them. History has shown, that things of value will always be a target for a criminal. Cybercrime is no different. As people flood their personal computers, phones, and so on with valuable data, they put a target on that information for the criminal to aim for, in order to gain some form of profit from the activity. In the past, in order for a criminal to gain access to an individual's valuables, they would have to conduct a robbery in some shape or form. In the case of data theft, the criminal would need to break into a building, sifting through files looking for the information of greatest value and profit. In our modern world, the criminal can attack their victims from a distance, and due to the nature of the internet, these acts would most likely never meet retribution. Cybercrime in the 70s and 80s In the 70s, we saw criminals taking advantage of the tone system used on phone networks. The attack was called phreaking, where the attacker reverse-engineered the tones used by the telephone companies to make long distance calls. In 1988, the first computer worm made its debut on the internet and caused a great deal of destruction to organizations. This first worm was called the Morris worm, after its creator Robert Morris. While this worm was not originally intended to be malicious it still caused a great deal of damage. The U.S. Government Accountability Office in 1980 estimated that the damage could have been as high as $10,000,000.00. 1989 brought us the first known ransomware attack, which targeted the healthcare industry. Ransomware is a type of malicious software that locks a user's data, until a small ransom is paid, which will result in the issuance of a cryptographic unlock key. In this attack, an evolutionary biologist named Joseph Popp distributed 20,000 floppy disks across 90 countries, and claimed the disk contained software that could be used to analyze an individual's risk factors for contracting the AIDS virus. The disk however contained a malware program that when executed, displayed a message requiring the user to pay for a software license. Ransomware attacks have evolved greatly over the years with the healthcare field still being a very large target. The birth of the web and a new dawn for cybercrime The 90s brought the web browser and email to the masses, which meant new tools for cybercriminals to exploit. This allowed the cybercriminal to greatly expand their reach. Up till this time, the cybercriminal needed to initiate a physical transaction, such as providing a floppy disk. Now cybercriminals could transmit virus code over the internet in these new, highly vulnerable web browsers. Cybercriminals took what they had learned previously and modified it to operate over the internet, with devastating results. Cybercriminals were also able to reach out and con people from a distance with phishing attacks. No longer was it necessary to engage with individuals directly. You could attempt to trick millions of users simultaneously. Even if only a small percentage of people took the bait you stood to make a lot of money as a cybercriminal. The 2000s brought us social media and saw the rise of identity theft. A bullseye was painted for cybercriminals with the creation of databases containing millions of users' personal identifiable information (PII), making identity theft the new financial piggy bank for criminal organizations around the world. This information coupled with a lack of cybersecurity awareness from the general public allowed cybercriminals to commit all types of financial fraud such as opening bank accounts and credit cards in the name of others. Cybercrime in a fast-paced technology landscape Today we see that cybercriminal activity has only gotten worse. As computer systems have gotten faster and more complex we see that the cybercriminal has become more sophisticated and harder to catch. Today we have botnets, which are a network of private computers that are infected with malicious software and allow the criminal element to control millions of infected computer systems across the globe. These botnets allow the criminal element to overload organizational networks and hide the origin of the criminals: We see constant ransomware attacks across all sectors of the economy People are constantly on the lookout for identity theft and financial fraud Continuous news reports regarding the latest point of sale attack against major retailers and hospitality organizations This is an extract from Information Security Handbook by Darren Death. Follow Darren on Twitter: @DarrenDeath.

0
2
16605

article-image-vulnerabilities-in-the-application-and-transport-layer-of-the-tcp-ip-stack

Melisha Dsouza

07 Feb 2019

15 min read

Vulnerabilities in the Application and Transport Layer of the TCP/IP stack

Melisha Dsouza

07 Feb 2019

15 min read

The Transport layer is responsible for end-to-end data communication and acts as an interface for network applications to access the network. This layer also takes care of error checking, flow control, and verification in the TCP/IP protocol suite. The Application Layer handles the details of a particular application and performs 3 main tasks- formatting data, presenting data and transporting data. In this tutorial, we will explore the different types of vulnerabilities in the Application and Transport Layer. This article is an excerpt from a book written by Glen D. Singh, Rishi Latchmepersad titled CompTIA Network+ Certification Guide This book covers all CompTIA certification exam topics in an easy-to-understand manner along with plenty of self-assessment scenarios for better preparation. This book will not only prepare you conceptually but will also help you pass the N10-007 exam. Vulnerabilities in the Application Layer The following are some of the application layer protocols which we should pay close attention to in our network: File Transfer Protocol (FTP) Telnet Secure Shell (SSH) Simple Mail Transfer Protocol (SMTP) Domain Name System (DNS) Dynamic Host Configuration Protocol (DHCP) Hypertext Transfer Protocol (HTTP) Each of these protocols was designed to provide the function it was built to do and with a lesser focus on security. Malicious users and hackers are able to compromise both the application that utilizes these protocols and the network protocols themselves. Cross Site Scripting (XSS) XSS focuses on exploiting a weakness in websites. In an XSS attack, the malicious user or hacker injects client-side scripts into a web page/site that a potential victim would trust. The scripts can be JavaScript, VBScript, ActiveX, and HTML, or even Flash (ActiveX), which will be executed on the victim's system. These scripts will be masked as legitimate requests between the web server and the client's browser. XSS focuses on the following: Redirecting a victim to a malicious website/server Using hidden Iframes and pop-up messages on the victim's browser Data manipulation Data theft Session hijacking Let's take a deeper look at what happens in an XSS attack: An attacker injects malicious code into a web page/site that a potential victim trusts. A trusted site can be a favorite shopping website, social media platform, or school or university web portal. A potential victim visits the trusted site. The malicious code interacts with the victim's web browser and executes. The web browser is usually unable to determine whether the scripts are malicious or not and therefore still executes the commands. The malicious scripts can be used obtain cookie information, tokens, session information, and so on about other websites that the browser has stored information about. The acquired details (cookies, tokens, sessions ID, and so on) are sent back to the hacker, who in turn uses them to log in to the sites that the victim's browser has visited: There are two types of XSS attacks: Stored XSS (persistent) Reflected (non-persistent) Stored XSS (persistent): In this attack, the attacker injects a malicious script directly into the web application or a website. The script is stored permanently on the page, so when a potential victim visits the compromised page, the victim's web browser will parse all the code of the web page/application fine. Afterward, the script is executed in the background without the victim's knowledge. At this point, the script is able to retrieve session cookies, passwords, and any other sensitive information stored in the user's web browser, and sends the loot back to the attacker in the background. Reflective XSS (non-persistent): In this attack, the attacker usually sends an email with the malicious link to the victim. When the victim clicks the link, it is opened in the victim's web browser (reflected), and at this point, the malicious script is invoked and begins to retrieve the loot (passwords, credit card numbers, and so on) stored in the victim's web browser. SQL injection (SQLi) SQLi attacks focus on parsing SQL commands into an SQL database that does not validate the user input. The attacker attempts to gain unauthorized access to a database either by creating or retrieving information stored in the database application. Nowadays, attackers are not only interested in gaining access, but also in retrieving (stealing) information and selling it to others for financial gain. SQLi can be used to perform: Authentication bypass: Allows the attacker to log in to a system without a valid user credential Information disclosure: Retrieves confidential information from the database Compromise data integrity: The attacker is able to manipulate information stored in the database Lightweight Directory Access Protocol (LDAP) injection LDAP is designed to query and update directory services, such as a database like Microsoft Active Directory. LDAP uses both TCP and UDP port 389 and LDAP uses port 636. In an LDAP injection attack, the attacker exploits the vulnerabilities within a web application that constructs LDAP messages or statements, which are based on the user input. If the receiving application does not validate or sanitize the user input, this increases the possibility of manipulating LDAP messages. Cross-Site Request Forgery (CSRF) This attack is a bit similar to the previously mentioned XSS attack. In a CSRF attack, the victim machine/browser is forced to execute malicious actions against a website with which the victim has been authenticated (a website that trusts the actions of the user). To have a better understanding of how this attack works, let's visualize a potential victim, Bob. On a regular day, Bob visits some of his favorite websites, such as various blogs, social media platforms, and so on, where he usually logs in automatically to view the content. Once Bob logs in to a particular website, the website would automatically trust the transactions between itself and the authenticated user, Bob. One day, he receives an email from the attacker but unfortunately Bob does not realize the email is a phishing/spam message and clicks on the link within the body of the message. His web browser opens the malicious URL in a new tab: The attack would cause Bob's machine/web browser to invoke malicious actions on the trusted website; the website would see all the requests are originating from Bob. The return traffic such as the loot (passwords, credit card details, user account, and so on) would be returned to the attacker. Session hijacking When a user visits a website, a cookie is stored in the user's web browser. Cookies are used to track the user's preferences and manage the session while the user is on the site. While the user is on the website, a session ID is also set within the cookie, and this information may be persistent, which allows a user to close the web browser and then later revisit the same website and automatically log in. However, the web developer can set how long the information is persistent for, whether it expires after an hour or a week, depending on the developer's preference. In a session hijacking attack, the attacker can attempt to obtain the session ID while it is being exchanged between the potential victim and the website. The attacker can then use this session ID of the victim on the website, and this would allow the attacker to gain access to the victim's session, further allowing access to the victim's user account and so on. Cookie poisoning A cookie stores information about a user's preferences while he/she is visiting a website. Cookie poisoning is when an attacker has modified a victim's cookie, which will then be used to gain confidential information about the victim such as his/her identity. DNS Distributed Denial-of-Service (DDoS) A DDoS attack can occur against a DNS server. Attacker sometimes target Internet Service Providers (ISPs) networks, public and private Domain Name System (DNS) servers, and so on to prevent other legitimate users from accessing the service. If a DNS server is unable to handle the amount of requests coming into the server, its performance will eventually begin to degrade gradually, until it either stops responding or crashes. This would result in a Denial-of-Service (DoS) attack. Registrar hijacking Whenever a person wants to purchase a domain, the person has to complete the registration process at a domain registrar. Attackers do try to compromise users accounts on various domain registrar websites in the hope of taking control of the victim's domain names. With a domain name, multiple DNS records can be created or modified to direct incoming requests to a specific device. If a hacker modifies the A record on a domain to redirect all traffic to a compromised or malicious server, anyone who visits the compromised domain will be redirected to the malicious website. Cache poisoning Whenever a user visits a website, there's the process of resolving a host name to an IP address which occurs in the background. The resolved data is stored within the local system in a cache area. The attacker can compromise this temporary storage area and manipulate any further resolution done by the local system. Typosquatting McAfee outlined typosquatting, also known as URL hijacking, as a type of cyber-attack that allows an attacker to create a domain name very close to a company's legitimate domain name in the hope of tricking victims into visiting the fake website to either steal their personal information or distribute a malicious payload to the victim's system. Let's take a look at a simple example of this type of attack. In this scenario, we have a user, Bob, who frequently uses the Google search engine to find his way around the internet. Since Bob uses the www.google.com website often, he sets it as his homepage on the web browser so each time he opens the application or clicks the Home icon, www.google.com is loaded onto the screen. One day Bob decides to use another computer, and the first thing he does is set his favorite search engine URL as his home page. However, he typed www.gooogle.com and didn't realize it. Whenever Bob visits this website, it looks like the real website. Since the domain was able to be resolved to a website, this is an example of how typosquatting works. It's always recommended to use a trusted search engine to find a URL for the website you want to visit. Trusted internet search engine companies focus on blacklisting malicious and fake URLs in their search results to help protect internet users such as yourself. Vulnerabilities at the Transport Layer In this section, we are going to discuss various weaknesses that exist within the underlying protocols of the Transport Layer. Fingerprinting In the cybersecurity world, fingerprinting is used to discover open ports and services that are running open on the target system. From a hacker's point of view, fingerprinting is done before the exploitation phase, as the more information a hacker can obtain about a target, the hacker can then narrow its attack scope and use specific tools to increase the chances of successfully compromising the target machine. This technique is also used by system/network administrators, network security engineers, and cybersecurity professionals alike. Imagine you're a network administrator assigned to secure a server; apart from applying system hardening techniques such as patching and configuring access controls, you would also need to check for any open ports that are not being used. Let's take a look at a more practical approach to fingerprinting in the computing world. We have a target machine, 10.10.10.100, on our network. As a hacker or a network security professional, we would like to know which TCP and UDP ports are open, the services that use the open ports, and the service daemon running on the target system. In the following screenshot, we've used nmap to help us discover the information we are seeking. The NMap tools delivers specially crafted probes to a target machine: Enumeration In a cyber attack, the hacker uses enumeration techniques to extract information about the target system or network. This information will aid the attacker in identifying system attack points. The following are the various network services and ports that stand out for a hacker: Port 53: DNS zone transfer and DNS enumeration Port 135: Microsoft RPC Endpoint Mapper Port 25: Simple Mail Transfer Protocol (SMTP) DNS enumeration DNS enumeration is where an attacker is attempting to determine whether there are other servers or devices that carry the domain name of an organization. Let's take a look at how DNS enumeration works. Imagine we are trying to find out all the publicly available servers Google has on the internet. Using the host utility in Linux and specifying a hostname, host www.google.com, we can see the IP address 172.217.6.196 has been resolved successfully. This means there's a device with a host name of www.google.com active. Furthermore, if we attempt to resolve the host name, gmail.google.com, another IP address is presented but when we attempt to resolve mx.google.com, no IP address is given. This is an indication that there isn't an active device with the mx.google.com host name: DNS zone transfer DNS zone transfer allows the copying of the master file from a DNS server to another DNS server. There are times when administrators do not configure the security settings on their DNS server properly, which allows an attacker to retrieve the master file containing a list of the names and addresses of a corporate network. Microsoft RPC Endpoint Mapper Not too long ago, CVE-2015-2370 was recorded on the CVE database. This vulnerability took advantage of the authentication implementation of the Remote Procedure Call (RPC) protocol in various versions of the Microsoft Windows platform, both desktop and server operating systems. A successful exploit would allow an attacker to gain local privileges on a vulnerable system. SMTP SMTP is used in mail servers, as with the POP and the Internet Message Access Protocol (IMAP). SMTP is used for sending mail, while POP and IMAP are used to retrieve mail from an email server. SMTP supports various commands, such as EXPN and VRFY. The EXPN command can be used to verify whether a particular mailbox exists on a local system, while the VRFY command can be used to validate a username on a mail server. An attacker can establish a connection between the attacker's machine and the mail server on port 25. Once a successful connection has been established, the server will send a banner back to the attacker's machine displaying the server name and the status of the port (open). Once this occurs, the attacker can then use the VRFY command followed by a user name to check for a valid user on the mail system using the VRFY bob syntax. SYN flooding One of the protocols that exist at the Transport Layer is TCP. TCP is used to establish a connection-oriented session between two devices that want to communication or exchange data. Let's recall how TCP works. There are two devices that want to exchange some messages, Bob and Alice. Bob sends a TCP Synchronization (SYN) packet to Alice, and Alice responds to Bob with a TCP Synchronization/Acknowledgment (SYN/ACK) packet. Finally, Bob replies with a TCP Acknowledgement (ACK) packet. The following diagram shows the TCP 3-Way Handshake mechanism: For every TCP SYN packet received on a device, a TCP ACK packet must be sent back in response. One type of attack that takes advantage of this design flaw in TCP is known as a SYN Flood attack. In a SYN Flood attack, the attacker sends a continuous stream of TCP SYN packets to a target system. This would cause the target machine to process each individual packet and response accordingly; eventually, with the high influx of TCP SYN packets, the target system will become too overwhelmed and stop responding to any requests: TCP reassembly and sequencing During a TCP transmission of datagrams between two devices, each packet is tagged with a sequence number by the sender. This sequence number is used to reassemble the packets back into data. During the transmission of packets, each packet may take a different path to the destination. This may cause the packets to be received in an out-of-order fashion, or in the order they were sent over the wire by the sender. An attacker can attempt to guess the sequencing numbers of packets and inject malicious packets into the network destined for the target. When the target receives the packets, the receiver would assume they came from the real sender as they would contain the appropriate sequence numbers and a spoofed IP address. Summary In this article, we have explored the different types of vulnerabilities that exist at the Application and Transport Layer of the TCP/IP protocol suite. To understand other networking concepts like network architecture, security, network monitoring, and troubleshooting; and ace the CompTIA certification exam, check out our book CompTIA Network+ Certification Guide AWS announces more flexibility its Certification Exams, drops its exam prerequisites Top 10 IT certifications for cloud and networking professionals in 2018 What matters on an engineering resume? Hacker Rank report says skills, not certifications

0
0
16537

article-image-tensorflow-always-tops-machine-learning-artificial-intelligence-tool-surveys

Sunith Shetty

23 Aug 2018

9 min read

Why TensorFlow always tops machine learning and artificial intelligence tool surveys

Sunith Shetty

23 Aug 2018

9 min read

TensorFlow is an open source machine learning framework for carrying out high-performance numerical computations. It provides excellent architecture support which allows easy deployment of computations across a variety of platforms ranging from desktops to clusters of servers, mobiles, and edge devices. Have you ever thought, why TensorFlow has become so popular in such a short span of time? What made TensorFlow so special, that we seeing a huge surge of developers and researchers opting for the TensorFlow framework? Interestingly, when it comes to artificial intelligence frameworks showdown, you will find TensorFlow emerging as a clear winner most of the time. The major credit goes to the soaring popularity and contributions across various forums such as GitHub, Stack Overflow, and Quora. The fact is, TensorFlow is being used in over 6000 open source repositories showing their roots in many real-world research and applications. How TensorFlow came to be The library was developed by a group of researchers and engineers from the Google Brain team within Google AI organization. They wanted a library that provides strong support for machine learning and deep learning and advanced numerical computations across different scientific domains. Since the time Google open sourced its machine learning framework in 2015, TensorFlow has grown in popularity with more than 1500 projects mentions on GitHub. The constant updates made to the TensorFlow ecosystem is the real cherry on the cake. This has ensured all the new challenges developers and researchers face are addressed, thus easing the complex computations and providing newer features, promises, and performance improvements with the support of high-level APIs. By open sourcing the library, the Google research team have received all the benefits from a huge set of contributors outside their existing core team. Their idea was to make TensorFlow popular by open sourcing it, thus making sure all new research ideas are implemented in TensorFlow first allowing Google to productize those ideas. Read Also: 6 reasons why Google open sourced TensorFlow What makes TensorFlow different from the rest? With more and more research and real-life use cases going mainstream, we can see a big trend among programmers, and developers flocking towards the tool called TensorFlow. The popularity for TensorFlow is quite evident, with big names adopting TensorFlow for carrying out artificial intelligence tasks. Many popular companies such as NVIDIA, Twitter, Snapchat, Uber and more are using TensorFlow for all their major operations and research areas. On one hand, someone can make a case that TensorFlow’s popularity is based on its origins/legacy. TensorFlow being developed under the house of “Google” enjoys the reputation of the household name. There’s no doubt, TensorFlow has been better marketed than some of its competitors. Source: The Data Incubator However that’s not the full story. There are many other compelling reasons why small scale to large scale companies prefer using TensorFlow over other machine learning tools TensorFlow key functionalities TensorFlow provides an accessible and readable syntax which is essential for making these programming resources easier to use. The complex syntax is the last thing developers need to know given machine learning’s advanced nature. TensorFlow provides excellent functionalities and services when compared to other popular deep learning frameworks. These high-level operations are essential for carrying out complex parallel computations and for building advanced neural network models. TensorFlow is a low-level library which provides more flexibility. Thus you can define your own functionalities or services for your models. This is a very important parameter for researchers because it allows them to change the model based on changing user requirements. TensorFlow provides more network control. Thus allowing developers and researchers to understand how operations are implemented across the network. They can always keep track of new changes done over time. Distributed training The trend of distributed deep learning began in 2017, when Facebook released a paper showing a set of methods to reduce the training time of a convolutional neural network model. The test was done on RESNET-50 model on ImageNet dataset which took one hour to train instead of two weeks. 256 GPUs spread over 32 servers were used. This revolutionary test has open the gates for many research work which have massively reduced the experimentation time by running many tasks in parallel on multiple GPUs. Google’s distributed TensorFlow has allowed all the researchers and developers to scale out complex distributed training using in-built methods and operations that optimizes distributed deep learning among servers. . Google’s distributed TensorFlow engine which is part of the regular TensorFlow repo, works exceptionally well with the existing TensorFlow’s operations and functionalities. It has allowed exploring two of the most important distributed methods: Distribute the training time of a neural network model over many servers to reduce the training time. Searching for good hyperparameters by running parallel experiments over multiple servers. Google has given distributed TensorFlow engine the required power to steal the share of the market acquired by other distributed projects such as Microsoft’s CNTK, AMPLab's SparkNet, and CaffeOnSpark. Even though the competition is tough, Google has still managed to become more popular when compared to the other alternatives in the market. From research to production Google has, in some ways, democratized deep learning., The key reason is TensorFlow’s high-level APIs making deep learning accessible to everyone. TensorFlow provides pre-built functions and advanced operations to ease the task of building different neural network models. It provides the required infrastructure and hardware which makes them one of the leading libraries used extensively by researchers and students in the deep learning domain. In addition to research tools, TensorFlow extends the services by bringing the model in production using TensorFlow Serving. It is specifically designed for production environments, which provides a flexible, high-performance serving system for machine learning models. It provides all the functionalities and operations which makes it easy to deploy new algorithms and experiments as per changing requirements and preferences. It provides an excellent feature of out-of-the-box integration with TensorFlow models which can be easily extended to serve other types of models and data. TensorFlow’s API is a complete package which is easier to use and read, plus provides helpful operators, debugging and monitoring tools, and deployment features. This has led to growing use of TensorFlow library as a complete package within the ecosystem by the emerging body of students, researchers, developers, production engineers from various fields who are gravitating towards artificial intelligence. There is a TensorFlow for web, mobile, edge, embedded and more TensorFlow provides a range of services and modules within their existing ecosystem making them as one of the ground-breaking end-to-end tools to provide state-of-the-art deep learning. TensorFlow.js for machine learning on the web JavaScript library for training and deploying machine learning models in the browser. This library provides flexible and intuitive APIs to build and train new and pre-existing models from scratch right in the browser or under Node.js. TensorFlow Lite for mobile and embedded ML It is a TensorFlow lightweight solution used for mobile and embedded devices. It is fast since it enables on-device machine learning inference with low latency. It supports hardware acceleration with the Android Neural Networks API. The future releases of TensorFlow Lite will bring more built-in operators, performance improvements, and will support more models to simplify the developer’s experience of bringing machine learning services within mobile devices. TensorFlow Hub for reusable machine learning A library which is used extensively to reuse machine learning models. Thus you can transfer learning by reusing parts of machine learning models. TensorBoard for visual debugging While training a complex neural network model, the computations you use in TensorFlow can be very confusing. TensorBoard makes it very easy to understand and debug your TensorFlow programs in the form of visualizations. It allows you to easily inspect and understand your TensorFlow runs and graphs. Sonnet Sonnet is a DeepMind library which is built on top of TensorFlow extensively used to build complex neural network models. All of this factors have made the TensorFlow library immensely appealing for building a wide spectrum of machine learning and deep learning projects. This tool has become a preferred choice for everyone from space research giant NASA and other confidential government agencies, to an impressive roster of private sector giants. Road Ahead for TensorFlow TensorFlow no doubt is better marketed compared to the other deep learning frameworks. The community appears to be moving very fast. In any given hour, there are approximately 10 people around the world contributing or improving the TensorFlow project on GitHub. TensorFlow dominates the field with the largest active community. It will be interesting to see what new advances TensorFlow and other utilities make possible for the future of our digital world. Continuing the recent trend of rapid updates, the TensorFlow team is making sure they address all the current and active challenges faced by the contributors and the developers while building machine learning and deep learning models. TensorFlow 2.0 will be a major update, we can expect the release candidate by next year early March. The preview version of this major milestone is expected to hit later this year. The major focus will be on ease of use, additional support for more platforms and languages, and eager execution will be the central feature of TensorFlow 2.0. This breakthrough version will add more functionalities and operations to handle current research areas such as reinforcement learning, GANs, building advanced neural network models more efficiently. Google will continue to invest and upgrade their existing TensorFlow ecosystem. According to Google’s CEO, Sundar Pichai “artificial intelligence is more important than electricity or fire.” TensorFlow is the solution they have come up with to bring artificial intelligence into reality and provide a stepping stone to revolutionize humankind. Read more The 5 biggest announcements from TensorFlow Developer Summit 2018 The Deep Learning Framework Showdown: TensorFlow vs CNTK Tensor Processing Unit (TPU) 3.0: Google’s answer to cloud-ready Artificial Intelligence

0
0
16361

Top 5 programming languages for crunching Big Data effectively

DevOps engineering and full-stack development – 2 sides of the same agile coin

5 types of deep transfer learning

NVIDIA leads the AI hardware race. But which of its GPUs should you use for deep learning?

WebGL 2.0: What you need to know

Best Machine Learning Datasets for beginners

Python web development: Django vs Flask in 2018

The Deep Learning Framework Showdown: TensorFlow vs CNTK

What makes functional programming a viable choice for artificial intelligence projects?

AI to the rescue: 5 ways machine learning can assist during emergency situations

Trending Topics

Iterative Machine Learning: A step towards Model Accuracy

Why is Hadoop dying?

The evolution of cybercrime

Vulnerabilities in the Application and Transport Layer of the TCP/IP stack

Why TensorFlow always tops machine learning and artificial intelligence tool surveys