Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

How-To Tutorials - High Performance

39 Articles
article-image-quantum-computing-is-poised-to-take-a-quantum-leap-with-industries-and-governments-on-its-side
Natasha Mathur
26 Jul 2018
9 min read
Save for later

Quantum Computing is poised to take a quantum leap with industries and governments on its side

Natasha Mathur
26 Jul 2018
9 min read
“We’re really just, I would say, six months out from having quantum computers that will outperform classical computers in some capacity". -- Joseph Emerson, Quantum Benchmark CEO There are very few people who would say that their life hasn’t been changed by the computers. But, there are certain tasks that even a computer isn’t capable of solving efficiently and that’s where Quantum Computers come into the picture. They are incredibly powerful machines that process information at a subatomic level. Quantum Computing leverages the logic and principles of Quantum Mechanics. Quantum computers operate one million times faster than any other device that you’re currently using. They are a whole different breed of machines that promise to revolutionize computing. How Quantum computers differ from traditional computers? To start with, let’s look at how these computers look like. Unlike your mobile or desktop computing devices, Quantum Computers will never be able to fit in your pocket and you cannot station these at desks. At least, for the next few years. Also, these are fragile computers that look like vacuum cells or tubes with a bunch of lasers shining into them, and the whole apparatus must be kept in temperatures near to absolute zero at all times. IBM Research Secondly, we look at the underlying mechanics of how they each compute. Quantum Computing is different from classic digital computing in the sense that classical computing requires data to be encoded into binary digits (bits), each of which is always present in one of the two definite states (0 or 1). Whereas, in Quantum Computing, data units called Quantum bits or qubits can be present in more than one state at a time, called, “superpositions” of states. Two particles can also exhibit “entanglement,” where changing of one state may instantaneously affect the other. Thirdly, let’s look at what make up these machines i.e., their hardware and software components. A regular computer’s hardware includes components such as a CPU, monitor, and routers for communicating across the computer network. The software includes systems programs and different protocols that run on top of the seven layers namely physical layer, a data-link layer, network layer, transport layer, session layer, presentation layer and the application layer. Now, Quantum computer also consists of hardware and software components. There are Ion traps, semiconductors, vacuum cells, Josephson junction and wire loops on the hardware side. The software and protocol that lies within a Quantum Computer are divided into five different layers namely physical layer, virtual, Quantum Error Correction ( QEC) layer, logical layer, and application layer. Layered Quantum Computer architecture Quantum Computing: A rapidly growing field Quantum Computing is special and is advancing on a large scale. According to a new market research report by marketsandmarkets, the Quantum Computing Market will be worth 495.3 Million USD by 2023. Also, as per ABI research, total revenue generated from quantum computing services will exceed $15 billion by 2028. Also, Matthew Brisse, research VP at Gartner states, “Quantum computing is heavily hyped and evolving at different rates, but it should not be ignored”. Now, there isn’t any certainty from the Quantum Computing world about when it will make the shift from scientific discoveries to applications for the average user, but, there are different areas, like pharmaceuticals, Machine Learning, Security, etc, that are leveraging the potential of Quantum Computing. Apart from these industries, even countries are getting their feet dirty in the field of Quantum Computing, lately, by joining the “Quantum arms race” ( the race for excelling in Quantum Technology). How is the tech industry driving Quantum Computing forward? Both big tech giants like IBM, Google, Microsoft and startups like Rigetti, D-wave and Quantum Benchmark are slowly, but steadily, bringing the Quantum Computing Era a step closer to us. Big Tech’s big bets on Quantum computing IBM established a landmark in the quantum computing world back in November 2017 when it announced its first powerful quantum computer that can handle 50 qubits. The company is also working on making its 20-qubit system available through its cloud computing platform. It is collaborating with startups such as Quantum Benchmark, Zapata Computing, 1Qbit, etc to further accelerate Quantum Computing. Google also announced Bristlecone, the largest 72 qubit quantum computer, earlier this year, which promises to provide a testbed for research into scalability and error rates of qubit technology. Google along with IBM plans to commercialize its quantum computing technologies in the next few years. Microsoft is also working on Quantum Computing with plans to build a topological quantum computer. This is an effort to bring a practical quantum computer quicker for commercial use. It also introduced a language called Q#  ( Q Sharp ), last year, along with tools to help coders develop software for quantum computers "We want to solve today's unsolvable problems and we have an opportunity with a unique, differentiated technology to do that," mentioned Todd HolmDahl, Corporate VP of Microsoft Quantum. Startups leading the quantum computing revolution - Quantum Valley, anyone? Apart from these major tech giants, startups are also stepping up their Quantum Computing game. D-Wave Systems Inc., a Canadian company, was the first one to sell a quantum computer named D-wave one, back in 2011, even though the usefulness of this computer is limited.  These quantum computers are based on adiabatic quantum computation and have been sold to government agencies, aerospace, and cybersecurity vendors. Earlier this year, D-Wave Systems announced that it has completed testing a working prototype of its next-generation processor, as well as the installation of a D-Wave 2000Q system for a customer. Another startup worth mentioning is San Francisco based Rigetti computing, which is racing against Google, Microsoft, IBM, and Intel to build Quantum Computing projects. Rigetti has raised nearly $70 million for development of quantum computers. According to Chad Rigetti, CEO at Rigetti, “This is going to be a very large industry—every major organization in the world will have to have a strategy for how to use this technology”. Lastly, Quantum Benchmark, another Canadian startup, is collaborating with Google to support the development of quantum computing. Quantum Benchmark’s True-Q technology has been integrated with Cirq, Google’s latest open-source quantum computing framework. Looking at the efforts these companies are putting in to boost the growth of Quantum computing, it’s hard to say who will win the quantum race. But one thing is clear, just as Silicon Valley drove product innovation and software adoption during the early days of computing, having many businesses work together and compete with each other, may not be a bad idea for quantum computing. How are countries preparing for the Quantum arms race? Though Quantum Computing hasn’t yet made its way to the consumer market, it’s remarkable progress in research and it’s infinite potential have now caught the attention of nations worldwide. China, the U.S and other great powers, are now joining the Quantum arms race. But, why are these governments so serious about investing in the research and development of Quantum Computing? The answer is simple: Quantum computing will be a huge competitive advantage to the government that finds breakthroughs with their Quantum technology-based innovation as it will be able to cripple militaries, topples the global economy as well as yields spectacular and quick solutions to complex problems. It further has the power to revolutionize everything in today’s world from cellular communications and navigations to sensors and imaging. China continues to leap ahead in the quantum race China has been consistently investing billions of dollars in the field of Quantum Computing. From building its first form of Quantum Computer, back in 2017, to coming out with world’s first unhackable “quantum satellite” in 2016, China is a clearly killing it when it comes to Quantum Computing. Last year, the Chinese government has also funded $10 billion for the construction of the world’s biggest quantum research facility planned to open in 2020. The country plans to develop quantum computers and performs quantum metrology to support the military and national defense efforts. The U.S is pushing hard to win the race Americans are also trying to top up their game and win the Quantum arms race. For instance, the Pentagon is working on applying quantum computing in the future to the U.S. military. They plan to establish highly secure and encrypted communications for satellites along with ensuring accurate navigation that does not need GPS signals. In fact, Congress has proposed $800 million funding to Pentagon’s quantum projects for the next five years. Similarly, the Defense Advanced Research Projects Agency (DARPA) is interested in exploring Quantum Computing to improve the general computing performance as well as artificial intelligence and machine learning systems. Other than that, the U.S. government spends about $200 million per year on quantum research. The UK is stepping up its game The United Kingdom is also not far behind in the Quantum arms race. It has a $400 million program underway for quantum-based sensing and timing. European Union is also planning to invest worth $1 billion spread over 10 years for projects involving scientific research and development of devices for sensing, communication, simulation, and computing. Other great powers such as Canada, Australia, and Israel are also acquainting themselves with the exciting world of Quantum Computing. These governments are confident that whoever makes it first, gets an upper hand for life. The Quantum Computing revolution has begun Quantum Computing has the power to lead revolutionary breakthroughs even though, Quantum computers that can outperform regular computers are not there yet. Also, this is quite a complex technology which a lot of people find hard to understand as the rules of the quantum world different drastically from those of the physical world. While all progress made is constrained to research labs at the moment, it has a lot of potentials, given the intense development, and investments that are going into Quantum Computing from governments and large corporations across the globe. It’s likely that we’ll be able to see the working prototypes of Quantum computers emerge soon. With the Quantum revolution here, the possibilities that lie ahead are limitless. Perhaps we could crack the big bang theory finally!? Or maybe quantum computing powered by AI will just speed up the arrival of singularity. Google AI releases Cirq and Open Fermion-Cirq to boost Quantum computation PyCon US 2018 Highlights: Quantum computing, blockchains, and serverless rule!  
Read more
  • 0
  • 1
  • 4319

article-image-solve-web-dev-problem-functional-programming
Guest Contributor
01 May 2018
14 min read
Save for later

Seven wrongs don’t make the one right: Solving a problem with Functional Javascript

Guest Contributor
01 May 2018
14 min read
Functional Programming has several advantages for better, more understandable and maintainable code. Currently, it's popping up in many languages and frameworks, but let's not dwell into the theory right now, and instead, let's consider a simple problem, and how to solve it in a functional way. [box type="shadow" align="" class="" width=""]This is a guest post written by Federico Kereki, the author of the book Mastering Functional JavaScriptProgramming. Federico Kereki is a Uruguayan Systems Engineer, with a Masters Degree in Education, and more than thirty years’ experience as a consultant, system developer, university professor and writer.[/box] Assume the following situation. You have developed an e-commerce site: the user can fill his shopping cart, and at the end, he must click on a BILL ME button, so his credit card will be charged. However, the user shouldn't click twice (or more) or he would be billed several times. The HTML part of your application might have something like this, somewhere: <button id=“billButton” onclick=“billTheUser(some, sales, data)”>Bill me</button> And, among the scripts you’d have something similar to this: function billTheUser(some, sales, data) { window.alert("Billing the user..."); // actually bill the user } (Assigning events handler directly in HTML, the way I did it, isn’t recommended. Rather, in unobtrusive fashion, you should assign the handlerthrough code. So... Do as I say, not as I do!) This is a very barebone explanation of the problem and your web page, but it's enough for our purposes. Let's now get to think about ways of avoiding repeated clicks on that button… How can you manage to avoid the user clicking more than once? We’ll consider several common solutions for this, and then analyze a functional way of solving the problem.OK, how many ways can you think of, to solve our problem? Let us take several solutions one by one, and analyze their quality. Solution #1: hope for the best! How can we solve the problem? The first so called solution may seem like a joke: do nothing, tell the user not to click twice, and hope for the best! This is a weasel way of avoiding the problem, but I've seen several websites that just warn the user about the risks of clicking more than once, and actually do nothing to prevent the situation... the user got billed twice? we warned him... it’s his fault! <button id="billButton" onclick="billTheUser(some, sales, data)">Bill me</button> <b>WARNING: PRESS ONLY ONCE, DO NOT PRESS AGAIN!!</b> OK, so this isn’t actually a solution; let's move on to more serious proposals... Solution #2: use a global flag The solution most people would probably think of first, is using some global variable to record whether the user has already clicked on the button. You'd define a flag named something like clicked, initialized with false. When the user clicks on the button, you check the flag. If clicked was false, you change it to true, and execute the function; otherwise, you don't do anything at all. let clicked = false; . . . function billTheUser(some, sales, data) { if (!clicked) { clicked = true; window.alert("Billing the user..."); // actually bill the user } } This obviously works, but has several problems that must be addressed. You are using a global variable, and you could change its value by accident.Global variables aren't a good idea, neither in JS nor in other languages. (For moregood reasons NOT to use global variables, read http:/​/​wiki.​c2.​com/​?GlobalVariablesAreBad) You must also remember to re-initialize it to false when the user starts buying again. If you don't, the user won’t be able to do a second buy, because paying will have become impossible. You will have difficulties testing this code because it depends on external things: that is, the clicked variable. So, this isn’t a very good solution either... let's keep thinking! Solution #3: Remove the handler We may go for a lateral kind of solution, and instead of having the function avoid repeated clicks, we might just remove the possibility of clicking altogether. function billTheUser(some, sales, data) { document.getElementById("billButton").onclick = null; window.alert("Billing the user..."); // actually bill the user } This solution also has some problems. The code is tightly coupled to the button, so you won't be able to reuse it elsewhere; You must remember to reset the handler; otherwise, the user won't be able to make a second buy; Testing will also be harder, because you'll have to provide some DOM elements We can enhance this solution a bit, and avoid coupling the function to the button, by providing the latter's id as an extra argument in the call. (This idea can also be applied to some of the solutions below.) The HTML part would be: <button id="billButton" onclick="billTheUser('billButton', some, sales, data)" > Bill me </button>; (note the extra argument) and the called function would be: function billTheUser(buttonId, some, sales, data) { document.getElementById(buttonId).onclick = null; window.alert("Billing the user..."); // actually bill the user } This solution is somewhat better. But, in essence, we are still using a global element: not a variable, but the onclick value. So, despite the enhancement, this isn't a very good solution too. Let's move on. Solution #4: Change the handler A variant to the previous solution would be not removing the click function, and rather assign a new one instead. We are using functions as first-class objects here, when we assign the alreadyBilled() function to the click event: function alreadyBilled() { window.alert("Your billing process is running; don't click, please."); } function billTheUser(some, sales, data) { document.getElementById("billButton").onclick = alreadyBilled; window.alert("Billing the user..."); // actually bill the user } Now here’s a good point in this solution: if the user clicks a second time, he'll get a pop-up warning not to do that, but he won't be billed again. (From the point of view of the user experience, it's better.) However, this solution still has the very same objections as the previous one (code coupled to the button, need to reset the handler, harder testing) so we won't consider this to be quite good anyway. Solution #5: disable the button A similar idea: instead of removing the event handler, disable the button, so the user won't be able to click: function billTheUser(some, sales, data) { document.getElementById("billButton").setAttribute("disabled", "true"); window.alert("Billing the user..."); // actually bill the user } This also works, but we still have objections as for the previous solutions (coupling the code to button, need to re-enable the button, harder testing) so we don't like this solution either. Solution #6: redefine the handler Another idea: instead of changing anything in the button, let's have the event handler change itself. The trick is in the second line; by assigning a new value to the billTheUser variable, we are actually dynamically changing what the function does! The first time you call the function, it will do its thing... but it will also change itself out of existence, by giving its name to a new function: function billTheUser(some, sales, data) { billTheUser = function() {}; window.alert("Billing the user..."); // actually bill the user } There’s a special trick in the solution. Functions are global, so the line billTheUser=...actually changes the function inner workings; from that point on, billTheUser will be a new (null) function. This solution is still hard to test. Even worse, how would you restore the functionality of billTheUser, setting it back to its original objective? Solution #7: use a local flag We can go back to the idea of using a flag, but instead of making it global (which was our main objection) we can use an Immediately Invoked Function Expression (IIFE) that, via a closure, makes clicked local to the function, and not visible anywhere else: var billTheUser = (clicked => { return (some, sales, data) => { if (!clicked) { clicked = true; window.alert("Billing the user..."); // actually bill the user } }; })(false); In particular, see how clicked gets its initial false value, from the call, at the end. This solution is along the lines of the global variable solution, but using a private, local variable is an enhancement. The only objection we could find is that you’ll have to rework every function that needs to be called only once, to work in this fashion. (And, as we’ll see below, our FP solution is similar in some ways to it.) OK, it’s not too hard to do, but don’t forget the Don’t Repeat Yourself (D.R.Y) advice! A functional, higher order, solution Now that we've gone through several “normal” solutions, let’s try to be more general: after all, requiring that some function or other be executed only once, isn’t that outlandish, and may be required elsewhere! Let’s lay down some principles: The original function (the one that may be called only once) should do that thing, and no other; We don’t want to modify the original function in any way; so, We need to have a new function, that will call the original one only once, and We want a general solution, that we can apply to any number of original functions. (The first principle listed above is the single responsibility principle (the S in S.O.L.I.D.) which states that every function should be responsible over a single functionality. For more on S.O.L.I.D., check the article by “Uncle Bob” (Robert C. Martin, who wrote the five principles) at http://butunclebob.com/ArticleS.UncleBob.PrinciplesOfOod). Can we do it? Yes; and we'll write a higher order function, which we'll be able to apply to any function, to produce a new function that will work only once. Let's see how! If we don't want to modify the original function, we'll create a higher order function, which we'll, inspiredly, name once(). This function will receive a function as a parameter, and will return a new function, which will work only a single time. (By the way, libraries such as Underscore and LoDash already have a similar function, invoked as _.once(). Ramda also provides R.once(), and most FP libraries include similar functionality, so you wouldn’t actually have to program it on your own.) Our once() function way seem imposing at first, but as you get accustomed to working in FP fashion, you'll get used to this sort of code, and will find it to be quite understandable: const once = fn => { let done = false; return (...args) => { if (!done) { done = true; fn(...args); } }; }; Let’s go over some of the finer points of this function: The first line shows that once() receives a function (fn) as its parameter. We are defining an internal, private done variable, by taking advantage of a closure, as in Solution #7, above. We opted not to call it clicked, as earlier, because you don’t necessarily need to click on a button to call the function; we went for a more general term. The line return (...args) => … says that once() will return a function, with some (0, 1, or more) parameters. We are using the modern spread syntax; with older versions of JS you'd have to work with the arguments object; see https:/​/developer.​mozilla.​org/​en/​docs/​Web/​JavaScript/​Reference/​Functions/arguments for more on that. The ES8 way is simpler and shorter! We assign done = true before calling fn(), just in case that function throws an exception. Of course, if you don't want to disable the function unless it has successfully ended, then you could move the assignment just below the fn() call. After that setting is done, we finally call the original function. Note the use of the spread operator to pass along whatever parameters the original fn() had. So, how would we use it? We don't even need to store the newly generated function in any place; we can simply write the onclick method, as shown as follows: <button id="billButton" onclick="once(billTheUser)(some, sales, data)"> Bill me </button>; Pay close attention to the syntax! When the user clicks on the button, the function that gets called with (some, sales, data) argument isn’t billTheUser(), but rather the result of having called once() with billTheUser as a parameter. That result is the one that can be called only a single time. We can run a simple test: const squeak = a => console.log(a + " squeak!!"); squeak("original");        // "original squeak!!" squeak("original");        // "original squeak!!" squeak("original");        // "original squeak!!" const squeakOnce = once(squeak); squeakOnce("only once");   // "only once squeak!!" squeakOnce("only once");   // no output squeakOnce("only once");   // no output Some varieties In one of the solutions above, we mentioned that it would be a good idea to do something every time after the first, and not silently ignoring the user's clicks. We'll write a new higher order function, that takes a second parameter; a function to be called every time from the second call onward: const onceAndAfter = (f, g) => { let done = false; return (...args) => { if (!done) { done = true; f(...args); } else { g(...args); } }; }; We have ventured further in higher-order functions; onceAndAfter() takes two functions as parameters and produces yet a third one, that includes the other two within. We could make onceAndAfter() more flexible, by giving a default value for g, along the lines of const onceAndAfter = (f, g = ()=>{})... so if you didn't want to specify the second function, it would still work fine, because it would then call a do-nothing function. We can do a quick-and-dirty test, along with the same lines as we did earlier: const squeak = () => console.log("squeak!!"); const creak = () => console.log("creak!!"); const makeSound = onceAndAfter(squeak, creak); makeSound(); // "squeak!!" makeSound(); // "creak!!" makeSound(); // "creak!!" makeSound(); // "creak!!" So, we got an even more enhanced function! Just to practice with some more varieties and possibilities, can you solve these extra questions? Everything has a limit! As an extension ofonce(), could you write a higherorder function thisManyTimes(fn,n) that would let you call fn() up to n times, but would afterward do nothing? To give an example, once(fn) and thisManyTimes(fn,1) would produce functions that behave in exactly the same way. Alternating functions. In the spirit of ouronceAndAfter()function, could youwrite an alternator() higher-order function that gets two functions as arguments, and on each call, alternatively calls one and another? The expected behavior should be as in the following example: let sayA = () => console.log("A"); let sayB = () => console.log("B"); let alt = alternator(sayA, sayB); alt(); // A alt(); // B alt(); // A alt(); // B alt(); // A alt(); // B To summarize, we've seen a simple problem, based on a real-life situation, and after analyzing several common ways of solving that, we chose a functional thinking solution. We saw how to apply FP to our problem, and we also found a more general higher order way that we could apply to similar problems, with no further code changes. Finally, we produced an even better solution, from the user experience point of view. Functional Programming can be applied all throughout your code, to enhance and simplify your development. This has been just a simple example, using higher order functions, but FP goes beyond that and provides many more tools for better coding. [author title="Author Bio"]The author, Federico Kereki, is currently a Subject Matter Expert at Globant, where he gets to use a good mixture of development frameworks, programming tools, and operating systems, such as JavaScript, Node.JS, React and Redux, SOA, Containers, and PHP, with both Windows and Linux. He has taught several computer science courses at Universidad de la República, Universidad ORT Uruguay, and Universidad de la Empresa, he has also written texts for these courses. Federico has written several articles—on JavaScript, web development, and open source topics—for magazines such as Linux Journal and LinuxPro Magazine in the United States, Linux+ and Mundo Linux in Europe, and for web sites such as Linux and IBM Developer Works. Kereki has given talks on Functional Programming with JavaScript in public conferences (such as JSCONF 2016) and used these techniques to develop Internet systems for businesses in Uruguay and abroad.[/author] Read More Introduction to the Functional Programming What is the difference between functional and object oriented programming? Manipulating functions in functional programming  
Read more
  • 0
  • 0
  • 3709

article-image-aspnet-core-high-performance
Packt
07 Mar 2018
20 min read
Save for later

ASP.NET Core High Performance

Packt
07 Mar 2018
20 min read
In this article by James Singleton, the author of the book ASP.NET Core High Performance, we will see that many things have changed for version 2 of the ASP.NET Core framework and there have also been a lot of improvements to the various supporting technologies. Now is a great time to give it a try, as the code has stabilized and the pace of change has settled down a bit. There were significant differences between the original release candidate and version 1 of ASP.NET Core and yet more alterations between version 1 and version 2. Some of these changes have been controversial, particularly around tooling but the scope of .NET Core has grown massively and ultimately this is a good thing. One of the highest profile differences between 1 and 2 is the change (and some would say regression) away from the new JavaScript Object Notation (JSON) based project format and back towards the Extensible Markup Language (XML) based .csproj format. However, it is a simplified and stripped down version compared to the format used in the original .NET Framework. There has been a move towards standardization between the different .NET frameworks and .NET Core 2 has a much larger API surface as a result. The interface specification known as .NET Standard 2 covers the intersection between .NET Core, the .NET Framework, and Xamarin. There is also an effort to standardize Extensible Application Markup Language (XAML) into the XAML Standard that will work across Universal Windows Platform (UWP) and Xamarin.Forms apps. C# and .NET can be used on a huge amount of platforms and in a large number of use cases, from server side web applications to mobile apps and even games using engines like Unity 3D. In this article we will go over the changes between version 1 and version 2 of the new Core releases. We will also look at some new features of the C# language. There are many useful additions and a plethora of performance improvement too. In this article we will cover: .NET Core 2 scope increases ASP.NET Core 2 additions Performance improvements .NET Standard 2 New C# 6 features New C# 7 features JavaScript considerations New in Core 2 There are two different products in the Core family. The first is .NET Core, which is the low level framework providing basic libraries. This can be used to write console applications and it is also the foundation for higher level application frameworks. The second is ASP.NET Core, which is a framework for building web applications that run on a server and service clients (usually web browsers). This was originally the only workload for .NET Core until it grew in scope to handle a more diverse range of scenarios. We'll cover the differences in the new versions separately for each of these frameworks. The changes in .NET Core will also apply to ASP.NET Core, unless you are running it on top of the .NET Framework version 4. New in .NET Core 2 The main focus of .NET Core 2 is the huge increase in scope. There are more than double the number of APIs included and it supports .NET Standard 2 (covered later in this article). You can also refer .NET Framework assemblies with no recompile required. This should just work as long as the assemblies only use APIs that have been implemented in .NET Core. This means that more NuGet packages will work with .NET Core. Finding if your favorite library was supported or not, was always a challenge with the previous version. The author set up a repository listing package compatibility to help with this. You can find the ASP.NET Core Library and Framework Support (ANCLAFS) list at github.com/jpsingleton/ANCLAFS and also via anclafs.com. If you want to make a change then please send a pull request. Hopefully in the future all packages will support Core and this list will no longer be required. There is now support in Core for Visual Basic and for more Linux distributions. You can also perform live unit testing with Visual Studio 2017, much like the old NCrunch extension. Performance improvements Some of the more interesting changes for 2 are the performance improvements over the original .NET Framework. There have been tweaks to the implementations of many of the framework data structures. Some of the classes and methods that have seen speed improvements or memory reduction include: List<T> Queue<T> SortedSet<T> ConcurrentQueue<T> Lazy<T> Enumerable.Concat() Enumerable.OrderBy() Enumerable.ToList() Enumerable.ToArray() DeflateStream SHA256 BigInteger BinaryFormatter Regex WebUtility.UrlDecode() Encoding.UTF8.GetBytes() Enum.Parse() DateTime.ToString() String.IndexOf() String.StartsWith() FileStream Socket NetworkStream SslStream ThreadPool SpinLock We won't go into specific benchmarks here because benchmarking is hard and the improvements you see will clearly depend on your usage. The thing to take away is that lots of work has been done to increase the performance of .NET Core, both over the previous version 1 and .NET Framework 4.7. Many of these changes have come from the community, which shows one of the benefits of open source development. Some of these advances will probably work their way back into a future version of the regular .NET Framework too. There have also been improvements to the RyuJIT compiler for .NET Core 2. As just one example, finally blocks are now almost as efficient as not using exception handing at all, in the normal situation where no exceptions are thrown. You now have no excuses not to liberally use try and using blocks, for example by having checked arithmetic to avoid integer overflows. New in ASP.NET Core 2 ASP.NET Core 2 takes advantage of all the improvements to .NET Core 2, if that is what you choose to run it on. It will also run on .NET Framework 4.7 but it's best to run it on .NET Core, if you can. With the increase in scope and support of .NET Core 2 this should be less of a problem than it was previously. It includes a new meta package so you only need to reference one NuGet item to get all the things! However, it is still composed of individual packages if you want to pick and choose. They haven't reverted back to the bad old days of one huge System.Web assembly. A new package trimming feature will ensure that if you don't use a package then its binaries won't be included in your deployment, even if you use the meta package to reference it. There is also a sensible default for setting up a web host configuration. You don't need to add logging, Kestrel, and IIS individually anymore. Logging has also got simpler and, as it is built in, you have no excuses not to use it from the start. A new feature is support for controller-less Razor Pages. These are exactly what they sound like and allow you to write pages with just a Razor template. This is similar to the Web Pages product, not to be confused with Web Forms. There is talk of Web Forms making a comeback, but if so then hopefully the abstraction will be thought out more and it won't carry so much state around with it. There is a new authentication model that makes better use of Dependency Injection. ASP.NET Core Identity allows you to use OpenID, OAuth 2 and get access tokens for your APIs. A nice time saver is you no longer need to emit anti-forgery tokens in forms (to prevent Cross Site Request Forgery) with attributes to validate them on post methods. This is all done automatically for you, which should prevent you forgetting to do this and leaving a security vulnerability. Performance improvements There have been additional increases to performance in ASP.NET Core that are not related to the improvements in .NET Core, which also help. Startup time has been reduced by shipping binaries that have already been through the Just In Time compilation process. Although not a new feature in ASP.NET Core 2, output caching is now available. In 1.0, only response caching was included, which simply set the correct HTTP headers. In 1.1, an in-memory cache was added and today you can use local memory or a distributed cache kept in SQL Server or Redis. Standards Standards are important, that's why we have so many of them. The latest version of the .NET Standard is 2 and .NET Core 2 implements this. A good way to think about .NET Standard is as an interface that a class would implement. The interface defines an abstract API but the concrete implementation of that API is left up to the classes that inherit from it. Another way to think about it is like the HTML5 standard that is supported by different web browsers. Version 2 of the .NET Standard was defined by looking at the intersection of the .NET Framework and Mono. This standard was then implemented by .NET Core 2, which is why is contains so many more APIs than version 1. Version 4.6.1 of the .NET Framework also implements .NET Standard 2 and there is work to support the latest versions of the .NET Framework, UWP and Xamarin (including Xamarin.Forms). There is also the new XAML Standard that aims to find the common ground between Xamarin.Forms and UWP. Hopefully it will include Windows Presentation Foundation (WPF) in the future. If you create libraries and packages that use these standards then they will work on all the platforms that support them. As a developer simply consuming libraries, you don't need to worry about these standards. It just means that you are more likely to be able to use the packages that you want, on the platforms you are working with. New C# features It not just the frameworks and libraries that have been worked on. The underlying language has also had some nice new features added. We will focus on C# here as it is the most popular language for the Common Language Runtime (CLR). Other options include Visual Basic and the functional programming language F#. C# is a great language to work with, especially when compared to a language like JavaScript. Although JavaScript is great for many reasons (such as its ubiquity and the number of frameworks available), the elegance and design of the language is not one of them. Many of these new features are just syntactic sugar, which means they don't add any new functionality. They simply provide a more succinct and easier to read way of writing code that does the same thing. C# 6 Although the latest version of C# is 7, there are some very handy features in C# 6 that often go underused. Also, some of the new additions in 7 are improvements on features added in 6 and would not make much sense without context. We will quickly cover a few features of C# 6 here, in case you are unaware of how useful they can be. String interpolation String interpolation is a more elegant and easier to work with version of the familiar string format method. Instead of supplying the arguments to embed in the string placeholders separately, you can now embed them directly in the string. This is far more readable and less error prone. Let us demonstrate with an example. Consider the following code that embeds an exception in a string. catch (Exception e) { Console.WriteLine("Oh dear, oh dear! {0}", e); } This embeds the first (and in this case only) object in the string at the position marked by the zero. It may seem simple but this quickly gets complex if you have many objects and want to add another at the start. You then have to correctly renumber all the placeholders. Instead you can now prefix the string with a dollar character and embed the object directly in it. This is shown in the following code that behaves the same as the previous example. catch (Exception e) { Console.WriteLine($"Oh dear, oh dear! {e}"); } The ToString() method on an exception outputs all the required information including name, message, stack trace and any inner exceptions. There is no need to deconstruct it manually, you may even miss things if you do. You can also use the same format strings as you are used to. Consider the following code that formats a date in a custom manner. Console.WriteLine($"Starting at: {DateTimeOffset.UtcNow:yyyy/MM/ddHH:mm:ss}"); When this feature was being built, the syntax was slightly different. So be wary of any old blog posts or documentation that may not be correct. Null conditional The null conditional operator is a way of simplifying null checks. You can now inline a check for null rather than using an if statement or ternary operator. This makes it easier to use in more places and will hopefully help you to avoid the dreaded null reference exception. You can avoid doing a manual null check like in the following code. int? length = (null == bytes) ? null : (int?)bytes.Length; This can now be simplified to the following statement by adding a question mark. int? length = bytes?.Length; Exception filters You can filter exceptions more easily with the when keyword. You no longer need to catch every type of exception that you are interested in and then filter manually inside the catch block. This is a feature that was already present in VB and F# so it's nice that C# has finally caught up. There are some small benefits to this approach. For example, if your filter is not matched then the exception can still be caught by other catch blocks in the same try statement. You also don't need to remember to re-throw the exception to avoid it being swallowed. This helps with debugging, as Visual Studio will no longer break, as it would when you throw. For example, you could check to see if there is a message in the exception and handle it differently, as shown here. catch (Exception e) when (e?.Message?.Length> 0) When this feature was in development, a different keyword (if) was used. So be careful of any old information online. One thing to keep in mind is that relying on a particular exception message is fragile. If your application is localized then the message may be in a different language than what you expect. This holds true outside of exception filtering too. Asynchronous availability Another small improvement is that you can use the await keyword inside catch and finally blocks. This was not initially allowed when this incredibly useful feature was added in C# 5. There is not a lot more to say about this. The implementation is complex but you don't need to worry about this unless you're interested in the internals. From a developer point of view, it just works, as in this trivial example. catch (Exception e) when (e?.Message?.Length> 0) { await Task.Delay(200); } This feature has been improved in C# 7, so read on. You will see async and await used a lot. Asynchronous programming is a great way of improving performance and not just from within your C# code. Expression bodies Expression bodies allow you to assign an expression to a method or getter property using the lambda arrow operator (=>) that you may be familiar with from fluent LINQ syntax. You no longer need to provide a full statement or method signature and body. This feature has also been improved in C# 7 so see the examples in the next section. For example, a getter property can be implemented like so. public static string Text => $"Today: {DateTime.Now:o}"; A method can be written in a similar way, such as the following example. private byte[] GetBytes(string text) => Encoding.UTF8.GetBytes(text); C# 7 The most recent version of the C# language is 7 and there are yet more improvements to readability and ease of use. We'll cover a subset of the more interesting changes here. Literals There are couple of minor additional capabilities and readability enhancements when specifying literal values in code. You can specify binary literals, which means you don't have to work out how to represent them using a different base anymore. You can also put underscores anywhere within a literal to make it easier to read the number. The underscores are ignored but allow you to separate digits into convention groupings. This is particularly well suited to the new binary literal as it can be very verbose listing out all those zeros and ones. Take the following example using the new 0b prefix to specify a binary literal that will be rendered as an integer in a string. Console.WriteLine($"Binary solo! {0b0000001_00000011_000000111_00001111}"); You can do this with other bases too, such as this integer, which is formatted to use a thousands separator. Console.WriteLine($"Over {9_000:#,0}!"); // Prints "Over 9,000!" Tuples One of the big new features in C# 7 is support for tuples. Tuples are groups of values and you can now return them directly from method calls. You are no longer restricted to returning a single value. Previously you could work around this limitation in a few sub-optimal ways, including creating a custom complex object to return, perhaps with a Plain Old C# Object (POCO) or Data Transfer Object (DTO), which are the same thing. You could have also passed in a reference using the ref or out keywords, which although there are improvements to the syntax are still not great. There was System.Tuple in C# 6 but this wasn't ideal. It was a framework feature, rather than a language feature and the items were only numbered and not named. With C# 7 tuples, you can name the objects and they make a great alternative to anonymous types, particularly in LINQ query expression lambda functions. As an example, if you only want to work on a subset of the data available, perhaps when filtering a database table with an O/RM such as Entity Framework, then you could use a tuple for this. The following example returns a tuple from a method. You may need to add the System.ValueTupleNuGet package for this to work. private static (int one, string two, DateTime three) GetTuple() { return (one: 1, two: "too", three: DateTime.UtcNow); } You can also use tuples in string interpolation and all the values are rendered, as shown here. Console.WriteLine($"Tuple = {GetTuple()}"); Out variables If you did want to pass parameters into a method for modification then you have always needed to declare them first. This is no longer necessary and you can simply declare the variables at the point you pass them in. You can also declare a variable to be discarded by using an underscore. This is particularly useful if you don't want to use the returned value, for example in some of the try parse methods of the native framework data types. Here we parse a date without declaring the dt variable first. DateTime.TryParse("2017-08-09", out var dt); In this example we test for an integer but we don't care what it is. var isInt = int.TryParse("w00t", out _); References You can now return values by reference from a method as well as consume them. This is a little like working with pointers in C but safer. For example, you can only return references that were passed into the method and you can't modify references to point to a different location in memory. This is a very specialist feature but in certain niche situations it can dramatically improve performance. Given the following method. private static ref string GetFirstRef(ref string[] texts) { if (texts?.Length> 0) { return ref texts[0]; } throw new ArgumentOutOfRangeException(); } You could call it like so, and the second console output line would appear differently (one instead of 1). var strings = new string[] { "1", "2" }; ref var first = ref GetFirstRef(ref strings); Console.WriteLine($"{strings?[0]}"); // 1 first = "one"; Console.WriteLine($"{strings?[0]}"); // one Patterns The other big addition is you can now match patterns in C# 7 using the is keyword. This simplifies testing for null and matching against types, among other things. It also lets you easily use the cast value. This is a simpler alternative to using full polymorphism (where a derived class can be treated as a base class and override methods). However, if you control the code base and can make use of proper polymorphism, then you should still do this and follow good Object-Oriented Programming (OOP) principles. In the following example, pattern matching is used to parse the type and value of an unknown object. private static int PatternMatch(object obj) { if (obj is null) { return 0; } if (obj is int i) { return i++; } if (obj is DateTime d || (obj is string str && DateTime.TryParse(str, out d))) { return d.DayOfYear; } return -1; } You can also use pattern matching in the cases of a switch statement and you can switch on non-primitive types such as custom objects. More expression bodies Expression bodies are expanded from the offering in C# 6 and you can now use them in more places, for example as object constructors and property setters. Here we extend our previous example to include setting the value on the property we were previously just reading. private static string text; public static string Text { get => text ?? $"Today: {DateTime.Now:r}"; set => text = value; } More asynchronous improvements There have been some small improvements to what async methods can return and, although small, they could offer big performance gains in certain situations. You no longer have to return a task, which can be beneficial if the value is already available. This can reduce the overheads of using async methods and creating a task object. JavaScript You can't write a book on web applications without covering JavaScript. It is everywhere. If you write a web app that does a full page load on every request and it's not a simple content site then it will feel slow. Users expect responsiveness. If you are a back-end developer then you may think that you don't have to worry about this. However, if you are building an API then you may want to make it easy to consume with JavaScript and you will need to make sure that your JSON is correctly and quickly serialized. Even if you are building a Single Page Application (SPA) in JavaScript (or TypeScript) that runs in the browser, the server can still play a key role. You can use SPA services to run Angular or React on the server and generate the initial output. This can increase performance, as the browser has something to render immediately. For example, there is a project called React.NET that integrates React with ASP.NET, and it supports ASP.NET Core. If you have been struggling to keep up with the latest developments in the .NET world then JavaScript is on another level. There seems to be something new almost every week and this can lead to framework fatigue and the paradox of choice. There is so much to choose from that you don't know what to pick. Summary In this article, you have seen a brief high-level summary of what has changed in .NET Core 2 and ASP.NET Core 2, compared to the previous versions. You are also now aware of .NET Standard 2 and what it is for. We have shown examples of some of the new features available in C# 6 and C# 7. These can be very useful in letting you write more with less, and in making your code more readable and easier to maintain.
Read more
  • 0
  • 0
  • 2868

article-image-exploring-compilers
Packt
23 Jun 2017
17 min read
Save for later

Exploring Compilers

Packt
23 Jun 2017
17 min read
In this article by Gabriele Lanaro, author of the book, Python High Performance - Second Edition, we will see that Python is a mature and widely used language and there is a large interest in improving its performance by compiling functions and methods directly to machine code rather than executing instructions in the interpreter. In this article, we will explore two projects--Numba and PyPy--that approach compilation in a slightly different way. Numba is a library designed to compile small functions on the fly. Instead of transforming Python code to C, Numba analyzes and compiles Python functions directly to machine code. PyPy is a replacement interpreter that works by analyzing the code at runtime and optimizing the slow loops automatically. (For more resources related to this topic, see here.) Numba Numba was started in 2012 by Travis Oliphant, the original author of NumPy, as a library for compiling individual Python functions at runtime using the Low-Level Virtual Machine  ( LLVM ) toolchain. LLVM is a set of tools designed to write compilers. LLVM is language agnostic and is used to write compilers for a wide range of languages (an important example is the clang compiler). One of the core aspects of LLVM is the intermediate representation (the LLVM IR), a very low-level platform-agnostic language similar to assembly, that can be compiled to machine code for the specific target platform. Numba works by inspecting Python functions and by compiling them, using LLVM, to the IR. As we have already seen in the last article, the speed gains can be obtained when we introduce types for variables and functions. Numba implements clever algorithms to guess the types (this is called type inference) and compiles type-aware versions of the functions for fast execution. Note that Numba was developed to improve the performance of numerical code. The development efforts often prioritize the optimization of applications that intensively use NumPy arrays. Numba is evolving really fast and can have substantial improvements between releases and, sometimes, backward incompatible changes.  To keep up, ensure that you refer to the release notes for each version. In the rest of this article, we will use Numba version 0.30.1; ensure that you install the correct version to avoid any error. The complete code examples in this article can be found in the Numba.ipynb notebook. First steps with Numba Getting started with Numba is fairly straightforward. As a first example, we will implement a function that calculates the sum of squares of an array. The function definition is as follows: def sum_sq(a): result = 0 N = len(a) for i in range(N): result += a[i] return result To set up this function with Numba, it is sufficient to apply the nb.jit decorator: from numba import nb @nb.jit def sum_sq(a): ... The nb.jit decorator won't do much when applied. However, when the function will be invoked for the first time, Numba will detect the type of the input argument, a , and compile a specialized, performant version of the original function. To measure the performance gain obtained by the Numba compiler, we can compare the timings of the original and the specialized functions. The original, undecorated function can be easily accessed through the py_func attribute. The timings for the two functions are as follows: import numpy as np x = np.random.rand(10000) # Original %timeit sum_sq.py_func(x) 100 loops, best of 3: 6.11 ms per loop # Numba %timeit sum_sq(x) 100000 loops, best of 3: 11.7 µs per loop You can see how the Numba version is order of magnitude faster than the Python version. We can also compare how this implementation stacks up against NumPy standard operators: %timeit (x**2).sum() 10000 loops, best of 3: 14.8 µs per loop In this case, the Numba compiled function is marginally faster than NumPy vectorized operations. The reason for the extra speed of the Numba version is likely that the NumPy version allocates an extra array before performing the sum in comparison with the in-place operations performed by our sum_sq function. As we didn't use array-specific methods in sum_sq, we can also try to apply the same function on a regular Python list of floating point numbers. Interestingly, Numba is able to obtain a substantial speed up even in this case, as compared to a list comprehension: x_list = x.tolist() %timeit sum_sq(x_list) 1000 loops, best of 3: 199 µs per loop %timeit sum([x**2 for x in x_list]) 1000 loops, best of 3: 1.28 ms per loop Considering that all we needed to do was apply a simple decorator to obtain an incredible speed up over different data types, it's no wonder that what Numba does looks like magic. In the following sections, we will dig deeper and understand how Numba works and evaluate the benefits and limitations of the Numba compiler. Type specializations As shown earlier, the nb.jit decorator works by compiling a specialized version of the function once it encounters a new argument type. To better understand how this works, we can inspect the decorated function in the sum_sq example. Numba exposes the specialized types using the signatures attribute. Right after the sum_sq definition, we can inspect the available specialization by accessing the sum_sq.signatures, as follows: sum_sq.signatures # Output: # [] If we call this function with a specific argument, for instance, an array of float64 numbers, we can see how Numba compiles a specialized version on the fly. If we also apply the function on an array of float32, we can see how a new entry is added to the sum_sq.signatures list: x = np.random.rand(1000).astype('float64') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),)] x = np.random.rand(1000).astype('float32') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),), (array(float32, 1d, C),)] It is possible to explicitly compile the function for certain types by passing a signature to the nb.jit function. An individual signature can be passed as a tuple that contains the type we would like to accept. Numba provides a great variety of types that can be found in the nb.types module, and they are also available in the top-level nb namespace. If we want to specify an array of a specific type, we can use the slicing operator, [:], on the type itself. In the following example, we demonstrate how to declare a function that takes an array of float64 as its only argument: @nb.jit((nb.float64[:],)) def sum_sq(a): Note that when we explicitly declare a signature, we are prevented from using other types, as demonstrated in the following example. If we try to pass an array, x, as float32, Numba will raise a TypeError: sum_sq(x.astype('float32')) # TypeError: No matching definition for argument type(s) array(float32, 1d, C) Another way to declare signatures is through type strings. For example, a function that takes a float64 as input and returns a float64 as output can be declared with the float64(float64) string. Array types can be declared using a [:] suffix. To put this together, we can declare a signature for our sum_sq function, as follows: @nb.jit("float64(float64[:])") def sum_sq(a): You can also pass multiple signatures by passing a list: @nb.jit(["float64(float64[:])", "float64(float32[:])"]) def sum_sq(a): Object mode versus native mode So far, we have shown how Numba behaves when handling a fairly simple function. In this case, Numba worked exceptionally well, and we obtained great performance on arrays and lists.The degree of optimization obtainable from Numba depends on how well Numba is able to infer the variable types and how well it can translate those standard Python operations to fast type-specific versions. If this happens, the interpreter is side-stepped and we can get performance gains similar to those of Cython. When Numba cannot infer variable types, it will still try and compile the code, reverting to the interpreter when the types can't be determined or when certain operations are unsupported. In Numba, this is called object mode and is in contrast to the intepreter-free scenario, called native mode. Numba provides a function, called inspect_types, that helps understand how effective the type inference was and which operations were optimized. As an example, we can take a look at the types inferred for our sum_sq function: sum_sq.inspect_types() When this function is called, Numba will print the type inferred for each specialized version of the function. The output consists of blocks that contain information about variables and types associated with them. For example, we can examine the N = len(a) line: # --- LINE 4 --- # a = arg(0, name=a) :: array(float64, 1d, A) # $0.1 = global(len: <built-in function len>) :: Function(<built-in function len>) # $0.3 = call $0.1(a) :: (array(float64, 1d, A),) -> int64 # N = $0.3 :: int64 N = len(a) For each line, Numba prints a thorough description of variables, functions, and intermediate results. In the preceding example, you can see (second line) that the argument a is correctly identified as an array of float64 numbers. At LINE 4, the input and return type of the len function is also correctly identified (and likely optimized) as taking an array of float64 numbers and returning an int64. If you scroll through the output, you can see how all the variables have a well-defined type. Therefore, we can be certain that Numba is able to compile the code quite efficiently. This form of compilation is called native mode. As a counter example, we can see what happens if we write a function with unsupported operations. For example, as of version 0.30.1, Numba has limited support for string operations. We can implement a function that concatenates a series of strings, and compiles it as follows: @nb.jit def concatenate(strings): result = '' for s in strings: result += s return result Now, we can invoke this function with a list of strings and inspect the types: concatenate(['hello', 'world']) concatenate.signatures # Output: [(reflected list(str),)] concatenate.inspect_types() Numba will return the output of the function for the reflected list (str) type. We can, for instance, examine how line 3 gets inferred. The output of concatenate.inspect_types() is reproduced here: # --- LINE 3 --- # strings = arg(0, name=strings) :: pyobject # $const0.1 = const(str, ) :: pyobject # result = $const0.1 :: pyobject # jump 6 # label 6 result = '' You can see how this time, each variable or function is of the generic pyobject type rather than a specific one. This means that, in this case, Numba is unable to compile this operation without the help of the Python interpreter. Most importantly, if we time the original and compiled function, we note that the compiled function is about three times slower than the pure Python counterpart: x = ['hello'] * 1000 %timeit concatenate.py_func(x) 10000 loops, best of 3: 111 µs per loop %timeit concatenate(x) 1000 loops, best of 3: 317 µs per loop This is because the Numba compiler is not able to optimize the code and adds some extra overhead to the function call.As you may have noted, Numba compiled the code without complaints even if it is inefficient. The main reason for this is that Numba can still compile other sections of the code in an efficient manner while falling back to the Python interpreter for other parts of the code. This compilation strategy is called object mode. It is possible to force the use of native mode by passing the nopython=True option to the nb.jit decorator. If, for example, we apply this decorator to our concatenate function, we observe that Numba throws an error on first invocation: @nb.jit(nopython=True) def concatenate(strings): result = '' for s in strings: result += s return result concatenate(x) # Exception: # TypingError: Failed at nopython (nopython frontend) This feature is quite useful for debugging and ensuring that all the code is fast and correctly typed. Numba and NumPy Numba was originally developed to easily increase performance of code that uses NumPy arrays. Currently, many NumPy features are implemented efficiently by the compiler. Universal functions with Numba Universal functions are special functions defined in NumPy that are able to operate on arrays of different sizes and shapes according to the broadcasting rules. One of the best features of Numba is the implementation of fast ufuncs. We have already seen some ufunc examples in article 3, Fast Array Operations with NumPy and Pandas. For instance, the np.log function is a ufunc because it can accept scalars and arrays of different sizes and shapes. Also, universal functions that take multiple arguments still work according to the  broadcasting rules. Examples of universal functions that take multiple arguments are np.sum or np.difference. Universal functions can be defined in standard NumPy by implementing the scalar version and using the np.vectorize function to enhance the function with the broadcasting feature. As an example, we will see how to write the Cantor pairing function. A pairing function is a function that encodes two natural numbers into a single natural number so that you can easily interconvert between the two representations. The Cantor pairing function can be written as follows: import numpy as np def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) As already mentioned, it is possible to create a ufunc in pure Python using the np.vectorized decorator: @np.vectorize def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) cantor(np.array([1, 2]), 2) # Result: # array([ 8, 12]) Except for the convenience, defining universal functions in pure Python is not very useful as it requires a lot of function calls affected by interpreter overhead. For this reason, ufunc implementation is usually done in C or Cython, but Numba beats all these methods by its convenience. All that is needed to do in order to perform the conversion is using the equivalent decorator, nb.vectorize. We can compare the speed of the standard np.vectorized version which, in the following code, is called cantor_py, and the same function is implemented using standard NumPy operations: # Pure Python %timeit cantor_py(x1, x2) 100 loops, best of 3: 6.06 ms per loop # Numba %timeit cantor(x1, x2) 100000 loops, best of 3: 15 µs per loop # NumPy %timeit (0.5 * (x1 + x2)*(x1 + x2 + 1) + x2).astype(int) 10000 loops, best of 3: 57.1 µs per loop You can see how the Numba version beats all the other options by a large margin! Numba works extremely well because the function is simple and type inference is possible. An additional advantage of universal functions is that, since they depend on individual values, their evaluation can also be executed in parallel. Numba provides an easy way to parallelize such functions by passing the target="cpu" or target="gpu" keyword argument to the nb.vectorize decorator. Generalized universal functions One of the main limitations of universal functions is that they must be defined on scalar values. A generalized universal function, abbreviated gufunc, is an extension of universal functions to procedures that take arrays. A classic example is the matrix multiplication. In NumPy, matrix multiplication can be applied using the np.matmul function, which takes two 2D arrays and returns another 2D array. An example usage of np.matmul is as follows: a = np.random.rand(3, 3) b = np.random.rand(3, 3) c = np.matmul(a, b) c.shape # Result: # (3, 3) As we saw in the previous subsection, a ufunc broadcasts the operation over arrays of scalars, its natural generalization will be to broadcast over an array of arrays. If, for instance, we take two arrays of 3 by 3 matrices, we will expect np.matmul to take to match the matrices and take their product. In the following example, we take two arrays containing 10 matrices of shape (3, 3). If we apply np.matmul, the product will be applied matrix-wise to obtain a new array containing the 10 results (which are, again, (3, 3) matrices): a = np.random.rand(10, 3, 3) b = np.random.rand(10, 3, 3) c = np.matmul(a, b) c.shape # Output # (10, 3, 3) The usual rules for broadcasting will work in a similar way. For example, if we have an array of (3, 3) matrices, which will have a shape of (10, 3, 3), we can use np.matmul to calculate the matrix multiplication of each element with a single (3, 3) matrix. According to the broadcasting rules, we obtain that the single matrix will be repeated to obtain a size of (10, 3, 3): a = np.random.rand(10, 3, 3) b = np.random.rand(3, 3) # Broadcasted to shape (10, 3, 3) c = np.matmul(a, b) c.shape # Result: # (10, 3, 3) Numba supports the implementation of efficient generalized universal functions through the nb.guvectorize decorator. As an example, we will implement a function that computes the euclidean distance between two arrays as a gufunc. To create a gufunc, we have to define a function that takes the input arrays, plus an output array where we will store the result of our calculation. The nb.guvectorize decorator requires two arguments: The types of the input and output: two 1D arrays as input and a scalar as output The so called layout string, which is a representation of the input and output sizes; in our case, we take two arrays of the same size (denoted arbitrarily by n), and we output a scalar In the following example, we show the implementation of the euclidean function using the nb.guvectorize decorator: @nb.guvectorize(['float64[:], float64[:], float64[:]'], '(n), (n) - > ()') def euclidean(a, b, out): N = a.shape[0] out[0] = 0.0 for i in range(N): out[0] += (a[i] - b[i])**2 There are a few very important points to be made. Predictably, we declared the types of the inputs a and b as float64[:], because they are 1D arrays. However, what about the output argument? Wasn't it supposed to be a scalar? Yes, however, Numba treats scalar argument as arrays of size 1. That's why it was declared as float64[:]. Similarly, the layout string indicates that we have two arrays of size (n) and the output is a scalar, denoted by empty brackets--(). However, the array out will be passed as an array of size 1. Also, note that we don't return anything from the function; all the output has to be written in the out array. The letter n in the layout string is completely arbitrary; you may choose to use k  or other letters of your liking. Also, if you want to combine arrays of uneven sizes, you can use layouts strings, such as (n, m). Our brand new euclidean function can be conveniently used on arrays of different shapes, as shown in the following example: a = np.random.rand(2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (1,) a = np.random.rand(10, 2) b = np.random.rand(10, 2) c = euclidean(a, b) # Shape: (10,) a = np.random.rand(10, 2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (10,) How does the speed of euclidean compare to standard NumPy? In the following code, we benchmark a NumPy vectorized version with our previously defined euclidean function: a = np.random.rand(10000, 2) b = np.random.rand(10000, 2) %timeit ((a - b)**2).sum(axis=1) 1000 loops, best of 3: 288 µs per loop %timeit euclidean(a, b) 10000 loops, best of 3: 35.6 µs per loop The Numba version, again, beats the NumPy version by a large margin! Summary Numba is a tool that compiles fast, specialized versions of Python functions at runtime. In this article, we learned how to compile, inspect, and analyze functions compiled by Numba. We also learned how to implement fast NumPy universal functions that are useful in a wide array of numerical applications.  Tools such as PyPy allow us to run Python programs unchanged to obtain significant speed improvements. We demonstrated how to set up PyPy, and we assessed the performance improvements on our particle simulator application. Resources for Article: Further resources on this subject: Getting Started with Python Packages [article] Python for Driving Hardware [article] Python Data Science Up and Running [article]
Read more
  • 0
  • 0
  • 2228

article-image-replication-solutions-postgresql
Packt
09 Mar 2017
14 min read
Save for later

Replication Solutions in PostgreSQL

Packt
09 Mar 2017
14 min read
In this article by Chitij Chauhan, Dinesh Kumar, the authors of the book PostgreSQL High Performance Cookbook, we will talk about various high availability and replication solutions including some popular third-party replication tool like Slony. (For more resources related to this topic, see here.) Setting up hot streaming replication Here in this recipe we are going to set up a master/slave streaming replication. Getting ready For this exercise you would need two Linux machines each with the latest version of PostgreSQL 9.6 installed. We will be using the following IP addresses for master and slave servers. Master IP address: 192.168.0.4 Slave IP Address: 192.168.0.5 How to do it… Given are the sequence of steps for setting up master/slave streaming replication: Setup password less authentication between master and slave for postgres user. First we are going to create a user ID on the master which will be used by slave server to connect to the PostgreSQL database on the master server: psql -c "CREATE USER repuser REPLICATION LOGIN ENCRYPTED PASSWORD 'charlie';" Next would be to allow the replication user that was created in the previous step to allow access to the master PostgreSQL server. This is done by making the necessary changes as mentioned in the pg_hba.conf file: Vi pg_hba.conf host replication repuser 192.168.0.5/32 md5 In the next step we are going to configure parameters in the postgresql.conf file. These parameters are required to be set in order to get the streaming replication working: Vi /var/lib/pgsql/9.6/data/postgresql.conf listen_addresses = '*' wal_level = hot_standby max_wal_senders = 3 wal_keep_segments = 8 archive_mode = on archive_command = 'cp %p /var/lib/pgsql/archive/%f && scp %p postgres@192.168.0.5:/var/lib/pgsql/archive/%f' Once the parameter changes have been made in the postgresql.conf file in the previous step ,the next step would be restart the PostgreSQL server on the master server in order to get the changes made in the previous step come into effect: pg_ctl -D /var/lib/pgsql/9.6/data restart Before the slave can replicate the master we would need to give it the initial database to build off. For this purpose we will make a base backup by copying the primary server's data directory to the standby: psql -U postgres -h 192.168.0.4 -c "SELECT pg_start_backup('label', true)" rsync -a /var/lib/pgsql/9.6/data/ 192.168.0.5:/var/lib/pgsql/9.6/data/ --exclude postmaster.pid psql -U postgres -h 192.168.0.4 -c "SELECT pg_stop_backup()" Once the data directory in the previous step is populated ,next step is to configure the following mentioned parameters in the postgresql.conf file on the slave server: hot_standby = on Next would be to copy the recovery.conf.sample in the $PGDATA location on the slave server and then configure the following mentioned parameters: cp /usr/pgsql-9.6/share/recovery.conf.sample /var/lib/pgsql/9.6/data/recovery.conf standby_mode = on primary_conninfo = 'host=192.168.0.4 port=5432 user=repuser password=charlie' trigger_file = '/tmp/trigger.replication' restore_command = 'cp /var/lib/pgsql/archive/%f "%p"' Next would be to start the slave server: service postgresql-9.6 start Now that the preceding mentioned replication steps are set up we will now test for replication. On the master server, login and issue the following mentioned SQL commands: psql -h 192.168.0.4 -d postgres -U postgres -W postgres=# create database test; postgres=# c test; test=# create table testtable ( testint int, testchar varchar(40) ); CREATE TABLE test=# insert into testtable values ( 1, 'What A Sight.' ); INSERT 0 1 On the slave server we will now check if the newly created database and the corresponding table in the previous step are replicated: psql -h 192.168.0.5 -d test -U postgres -W test=# select * from testtable; testint | testchar --------+---------------- 1 | What A Sight. (1 row) The wal_keep_segments parameter ensures that how many WAL files should be retained in the master pg_xlog in case of network delays. However if you do not want assume a value for this, you can create a replication slot which makes sure master does not remove the WAL files in pg_xlog until they have been received by standbys. For more information refer to: https://www.postgresql.org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SLOTS. How it works… The following is the explanation given for the steps done in the preceding section: In the initial step of the preceding section we create a user called repuser which will be used by the slave server to make a connection to the primary server. In step 2 of the preceding section we make the necessary changes in the pg_hba.conf file to allow the master server to be accessed by the slave server using the user ID repuser that was created in the step 2. We then make the necessary parameter changes on the master in step 4 of the preceding section for configuring streaming replication. Given is a description for these parameters:     Listen_Addresses: This parameter is used to provide the IP address that you want to have PostgreSQL listen too. A value of * indicates that all available IP addresses.     wal_level: This parameter determines the level of WAL logging done. Specify hot_standby for streaming replication.    wal_keep_segments: This parameter specifies the number of 16 MB WAL files to retain in the pg_xlog directory. The rule of thumb is that more such files may be required to handle a large checkpoint.     archive_mode: Setting this parameter enables completed WAL segments to be sent to archive storage.     archive_command: This parameter is basically a shell command is executed whenever a WAL segment is completed. In our case we are basically copying the file to the local machine and then we are using the secure copy command to send it across to the slave.     max_wal_senders: This parameter specifies the total number of concurrent connections allowed from the slave servers. Once the necessary configuration changes have been made on the master server we then restart the PostgreSQL server on the master in order to get the new configuration changes come into effect. This is done in step 5 of the preceding section. In step 6 of the preceding section, we were basically building the slave by copying the primary's data directory to the slave. Now with the data directory available on the slave the next step is to configure it. We will now make the necessary parameter replication related parameter changes on the slave in the postgresql.conf directory on the slave server. We set the following mentioned parameter on the slave:    hot_standby: This parameter determines if we can connect and run queries during times when server is in the archive recovery or standby mode In the next step we are configuring the recovery.conf file. This is required to be setup so that the slave can start receiving logs from the master. The following mentioned parameters are configured in the recovery.conf file on the slave:    standby_mode: This parameter when enabled causes PostgreSQL to work as a standby in a replication configuration.    primary_conninfo: This parameter specifies the connection information used by the slave to connect to the master. For our scenario the our master server is set as 192.168.0.4 on port 5432 and we are using the user ID repuser with password charlie to make a connection to the master. Remember that the repuser was the user ID which was created in the initial step of the preceding section for this purpose that is, connecting to the  master from the slave.    trigger_file: When slave is configured as a standby it will continue to restore the XLOG records from the master. The trigger_file parameter specifies what is used to trigger a slave to switch over its duties from standby and take over as master or being the primary server. At this stage the slave has been now fully configured and we then start the slave server and then replication process begins. In step 10 and 11 of the preceding section we are simply testing our replication. We first begin by creating a database test and then log into the test database and create a table by the name test table and then begin inserting some records into the test table. Now our purpose is to see whether these changes are replicated across the slave. To test this we then login into slave on the test database and then query the records from the test table as seen in step 10 of the preceding section. The final result that we see is that the all the records which are changed/inserted on the primary are visible on the slave. This completes our streaming replication setup and configuration. Replication using Slony Here in this recipe we are going to setup replication using Slony which is widely used replication engine. It replicates a desired set of tables data from one database to other. This replication approach is based on few event triggers which will be created on the source set of tables which will log the DML and DDL statements into a Slony catalog tables. By using Slony, we can also setup the cascading replication among multiple nodes. Getting ready The steps followed in this recipe are carried out on a CentOS Version 6 machine. We would first need to install Slony. The following mentioned are the steps needed to install Slony: First go to the mentioned web link and download the given software at http://slony.info/downloads/2.2/source/. Once you have downloaded the following mentioned software the next step is to unzip the tarball and then go the newly created directory: tar xvfj slony1-2.2.3.tar.bz2 cd slony1-2.2.3 In the next step we are going to configure, compile, and build the software: /configure --with-pgconfigdir=/usr/pgsql-9.6/bin/ make make install How to do it… The following mentioned are the sequence of steps required to replicate data between two tables using Slony replication: First start the PostgreSQL server if not already started: pg_ctl -D $PGDATA start In the next step we will be creating two databases test1 and test 2 which will be used as source and target databases: createdb test1 createdb test2 In the next step we will create the table t_test on the source database test1 and will insert some records into it: psql -d test1 test1=# create table t_test (id numeric primary key, name varchar); test1=# insert into t_test values(1,'A'),(2,'B'), (3,'C'); We will now set up the target database by copying the table definitions from the source database test1: pg_dump -s -p 5432 -h localhost test1 | psql -h localhost -p 5432 test2 We will now connect to the target database test2 and verify that there is no data in the tables of the test2 database: psql -d test2 test2=# select * from t_test; We will now setup a slonik script for master/slave that is, source/target setup: vi init_master.slonik #! /bin/slonik cluster name = mycluster; node 1 admin conninfo = 'dbname=test1 host=localhost port=5432 user=postgres password=postgres'; node 2 admin conninfo = 'dbname=test2 host=localhost port=5432 user=postgres password=postgres'; init cluster ( id=1); create set (id=1, origin=1); set add table(set id=1, origin=1, id=1, fully qualified name = 'public.t_test'); store node (id=2, event node = 1); store path (server=1, client=2, conninfo='dbname=test1 host=localhost port=5432 user=postgres password=postgres'); store path (server=2, client=1, conninfo='dbname=test2 host=localhost port=5432 user=postgres password=postgres'); store listen (origin=1, provider = 1, receiver = 2); store listen (origin=2, provider = 2, receiver = 1); We will now create a slonik script for subscription on the slave that is, target: vi init_slave.slonik #! /bin/slonik cluster name = mycluster; node 1 admin conninfo = 'dbname=test1 host=localhost port=5432 user=postgres password=postgres'; node 2 admin conninfo = 'dbname=test2 host=localhost port=5432 user=postgres password=postgres'; subscribe set ( id = 1, provider = 1, receiver = 2, forward = no); We will now run the init_master.slonik script created in step 6 and will run this on the master: cd /usr/pgsql-9.6/bin slonik init_master.slonik We will now run the init_slave.slonik script created in step 7 and will run this on the slave that is, target: cd /usr/pgsql-9.6/bin slonik init_slave.slonik In the next step we will start the master slon daemon: nohup slon mycluster "dbname=test1 host=localhost port=5432 user=postgres password=postgres" & In the next step we will start the slave slon daemon: nohup slon mycluster "dbname=test2 host=localhost port=5432 user=postgres password=postgres" & In the next step we will connect to the master that is, source database test1 and insert some records in the t_test table: psql -d test1 test1=# insert into t_test values (5,'E'); We will now test for replication by logging to the slave that is, target database test2 and see if the inserted records into the t_test table in the previous step are visible: psql -d test2 test2=# select * from t_test; id | name ----+------ 1 | A 2 | B 3 | C 5 | E (4 rows) How it works… We will now discuss about the steps followed in the preceding section: In step 1, we first start the PostgreSQL server if not already started. In step 2 we create two databases namely test1 and test2 that will serve as our source (master) and target (slave) databases. In step 3 of the preceding section we log into the source database test1 and create a table t_test and insert some records into the table. In step 4 of the preceding section we set up the target database test2 by copying the table definitions present in the source database and loading them into the target database test2 by using pg_dump utility. In step 5 of the preceding section we login into the target database test2 and verify that there are no records present in the table t_test because in step 5 we only extracted the table definitions into test2 database from test1 database. In step 6 we setup a slonik script for master/slave replication setup. In the file init_master.slonik we first define the cluster name as mycluster. We then define the nodes in the cluster. Each node will have a number associated to a connection string which contains database connection information. The node entry is defined both for source and target databases. The store_path commands are necessary so that each node knows how to communicate with the other. In step 7 we setup a slonik script for subscription of the slave that is, target database test2. Once again the script contains information such as cluster name, node entries which are designed a unique number related to connect string information. It also contains a subscriber set. In step 8 of the preceding section we run the init_master.slonik on the master. Similarly in step 9 we run the init_slave.slonik on the slave. In step 10 of the preceding section we start the master slon daemon. In step 11 of the preceding section we start the slave slon daemon. Subsequent section from step 12 and 13 of the preceding section are used to test for replication. For this purpose in step 12 of the preceding section we first login into the source database test1 and insert some records into the t_test table. To check if the newly inserted records have been replicated to target database test2 we login into the test2 database in step 13 and then result set obtained by the output of the query confirms that the changed/inserted records on the t_test table in the test1 database are successfully replicated across the target database test2. You may refer to the link given for more information regarding Slony replication at http://slony.info/documentation/tutorial.html. Summary We have seen how to setup streaming replication and then we looked at how to install and replicate using one popular third-party replication tool Slony. Resources for Article: Further resources on this subject: Introducing PostgreSQL 9 [article] PostgreSQL Cookbook - High Availability and Replication [article] PostgreSQL in Action [article]
Read more
  • 0
  • 0
  • 4085

article-image-learning-basic-nature-f-code
Packt
02 Nov 2016
6 min read
Save for later

Learning the Basic Nature of F# Code

Packt
02 Nov 2016
6 min read
In this article by Eriawan Kusumawardhono, author of the book, F# High Performance explains why F# has been a first class citizen, a built in part of programming languages support in Visual Studio, starting from Visual Studio 2010. Though F# is a programming language that has its own unique trait: it is a functional programming language but at the same time it has OOP support. F# from the start has run on .NET, although we can also run F# on cross-platform, such as Android (using Mono). (For more resources related to this topic, see here.) Although F# mostly runs faster than C# or VB when doing computations, its own performance characteristics and some not so obvious bad practices and subtleties may have led to performance bottlenecks. The bottlenecks may or may not be faster than C#/VB counterparts, although some of the bottlenecks may share the same performance characteristics, such as the use of .NET APIs. The main goal of this book is to identify performance problems in F#, measuring and also optimizing F# code to run more efficiently while also maintaining the functional programming style as appropriately as possible. A basic knowledge of F# (including the functional programming concept and basic OOP) is required as a prerequisite to start understanding the performance problems and the optimization of F#. There are many ways and definitions to define F# performance characteristics and at the same time measure them, but understanding the mechanics of running F# code, especially on top of .NET, is crucial and it's also a part of the performance characteristic itself. This includes other aspects of approaches to identify concurrency problems and language constructs. Understanding the nature of F# code Understanding the nature of F# code is very crucial and it is a definitive prerequisite before we begin to measure how long it runs and its effectiveness. We can measure a running F# code by running time, but to fully understand why it may run slow or fast, there are some basic concepts we have to consider first. Before we dive more into this, we must meet the basic requirements and setup. After the requirements have been set, we need to put in place the environment setting of Visual Studio 2015. We have to set this, because we need to maintain the consistency of the default setting of Visual Studio. The setting should be set to General. These are the steps: Select the Tools menu from Visual Studio's main menu. Select Import and Export Settings... and the Import and Export Settings Wizard screen is displayed. Select Reset all Settings and then Next to proceed. Select No, just reset my settings overwriting my current setting and then Next to proceed. Select General and then Next to proceed After setting it up, we will have a consistent layout to be used throughout this book, including the menu locations and the look and feel of Visual Studio. Now we are going to scratch the surface of F# runtime with an introductory overview of common F# runtime, which will give us some insights into F# performance. F# runtime characteristics The release of Visual Studio 2015 occurred at the same time as the release of .NET 4.6 and the rest of the tools, including the F# compiler. The compiler version of F# in Visual Studio 2015 is F# 4.0. F# 4.0 has no large differences or notable new features compared to the previous version, F# 3.0 in Visual Studio 2013. Its runtime characteristic is essentially the same as F# 4.0, although there are some subtle performance improvements and bug fixes. For more information on what's new in F# 4.0 (described as release notes) visit: https://github.com/Microsoft/visualfsharp/blob/fsharp4/CHANGELOG.md. At the time of writing this book, the online and offline MSDN Library of F# in Visual Studio does not have F# 4.0 release notes documentation, but can always go to the GitHub repository of F# to check the latest update. These are the common characteristics of F# as part of managed programming language: F# must conform to .NET CLR. This includes the compatibilities, the IL emitted after compile, and support for .NET BCL (the basic class library). Therefore, F# functions and libraries can be used by other CLR compliant languages such as C#, VB, and managed C++. The debug symbols (PDB) have the same format and semantic as other CLR compliant languages. This is important, because F# code must be able to be debugged from other CLR compliant languages as well. From the managed languages perspective, measuring performance of F# is similar when measured by tools such as the CLR profiler. But from a F# unique perspective, these are F#-only unique characteristics: By default, all types in F# are immutable. Therefore, it's safe to assume it is intrinsically thread safe. F# has a distinctive collection library, and it is immutable by default. It is also safe to assume it is intrinsically thread safe. F# has a strong type inference model, and when a generic type is inferred without any concrete type, it automatically performs generalizations. Default functions in F# are implemented internally by creating an internal class derived from F#’s FastFunc. This FastFunc is essentially a delegate that is used by F# to apply functional language constructs such as currying and partial application. With tail call recursive optimization in the IL, the F# compiler may emit .tail IL, and then the CLR will recognize this and perform optimization at runtime. F# has inline functions as option F# has a computation workflow that is used to compose functions F# async computation doesn't need Task<T> to implement it. Although F# async doesn't need the Task<T> object, it can operate well with the async-await model in C# and VB. The async-await model in C# and VB is inspired by F# async, but behaves semantically differently based on more things than just the usage of Task<T>. All of those characteristics are not only unique, but they can also have performance implications when used to interoperate with C# and VB. Summary This article explained the basic introduction to F# IDE, along with runtime characteristics of F#. Resources for Article: Further resources on this subject: Creating an F# Project [article] Unit Testing [article] Working with Windows Phone Controls [article]
Read more
  • 0
  • 0
  • 1490
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-its-all-about-data
Packt
14 Sep 2016
12 min read
Save for later

It's All About Data

Packt
14 Sep 2016
12 min read
In this article by Samuli Thomasson, the author of the book, Haskell High Performance Programming, we will know how to choose and design optimal data structures in applications. You will be able to drop the level of abstraction in slow parts of code, all the way to mutable data structures if necessary. (For more resources related to this topic, see here.) Annotating strictness and unpacking datatype fields We used the BangPatterns extension to make function arguments strict: {-# LANGUAGE BangPatterns #-} f !s (x:xs) = f (s + 1) xs f !s _ = s Using bangs for annotating strictness in fact predates the BangPatterns extension (and the older compiler flag -fbang-patterns in GHC6.x). With just plain Haskell98, we are allowed to use bangs to make datatype fields strict: > data T = T !Int A bang in front of a field ensures that whenever the outer constructor (T) is in WHNF, the inner field is as well in WHNF. We can check this: > T undefined `seq` () *** Exception: Prelude.undefined There are no restrictions to which fields can be strict, be it recursive or polymorphic fields, although, it rarely makes sense to make recursive fields strict. Consider the fully strict linked list: data List a = List !a !(List a) | ListEnd With this much strictness, you cannot represent parts of infinite lists without always requiring infinite space. Moreover, before accessing the head of a finite strict list you must evaluate the list all the way to the last element. Strict lists don't have the streaming property of lazy lists. By default, all data constructor fields are pointers to other data constructors or primitives, regardless of their strictness. This applies to basic data types such asInt, Double, Char, and so on, which are not primitive in Haskell. They are data constructors over their primitive counterparts Int#, Double#, and Char#: > :info Int data Int = GHC.Types.I# GHC.Prim.Int# There is a performance overhead, the size of pointer dereference between types, say, Int and Int#, but an Int can represent lazy values (called thunks), whereas primitives cannot. Without thunks, we couldn't have lazy evaluation. Luckily,GHC is intelligent enough to unroll wrapper types as primitives in many situations, completely eliminating indirect references. The hash suffix is specific to GHC and always denotes a primitive type. The GHC modules do expose the primitive interface. Programming with primitives you can further micro-optimize code and get C-like performance. However, several limitations and drawbacks apply. Using anonymous tuples Tuples may seem harmless at first; they just lump a bunch of values together. But note that the fields in a tuple aren't strict, so a twotuple corresponds to the slowest PairP data type from our previous benchmark. If you need a strict Tuple type, you need to define one yourself. This is also one more reason to prefer custom types over nameless tuples in many situations. These two structurally similar tuple types have widely different performance semantics: data Tuple = Tuple {-# UNPACK #-} !Int {-# UNPACK #-} !Int data Tuple2 = Tuple2 {-# UNPACK #-} !(Int, Int) If you really want unboxed anonymous tuples, you can enable the UnboxedTuples extension and write things with types like (# Int#, Char# #). But note that a number of restrictions apply to unboxed tuples like to all primitives. The most important restriction is that unboxed types may not occur where polymorphic types or values are expected, because polymorphic values are always considered as pointers. Representing bit arrays One way to define a bitarray in Haskell that still retains the convenience of Bool is: import Data.Array.Unboxed type BitArray = UArrayInt Bool This representation packs 8 bits per byte, so it's space efficient. See the following section on arrays in general to learn about time efficiency – for now we only note that BitArray is an immutable data structure, like BitStruct, and that copying small BitStructs is cheaper than copying BitArrays due to overheads in UArray. Consider a program that processes a list of integers and tells whether they are even or odd counts of numbers divisible by 2, 3, and 5. We can implement this with simple recursion and a three-bit accumulator. Here are three alternative representations for the accumulator: type BitTuple = (Bool, Bool, Bool) data BitStruct = BitStruct !Bool !Bool !Bool deriving Show type BitArray = UArrayInt Bool And the program itself is defined along these lines: go :: acc -> [Int] ->acc go acc [] = acc go (two three five) (x:xs) = go ((test 2 x `xor` two) (test 3 x `xor` three) (test 5 x `xor` five)) xs test n x = x `mod` n == 0 I've omitted the details here. They can be found in the bitstore.hs file. The fastest variant is BitStruct, then comes BitTuple (30% slower), and BitArray is the slowest (130% slower than BitStruct). Although BitArray is the slowest (due to making a copy of the array on every iteration), it would be easy to scale the array in size or make it dynamic. Note also that this benchmark is really on the extreme side; normally programs do a bunch of other stuff besides updating an array in a tight loop. If you need fast array updates, you can resort to mutable arrays discussed later on. It might also be tempting to use Data.Vector.Unboxed.VectorBool from the vector package, due to its nice interface. But beware that that representation uses one byte for every bit, wasting 7 bits for every bit. Mutable references are slow Data.IORef and Data.STRef are the smallest bits of mutable state, references to mutable variables, one for IO and other for ST. There is also a Data.STRef.Lazy module, which provides a wrapper over strict STRef for lazy ST. However, because IORef and STRef are references, they imply a level of indirection. GHC intentionally does not optimize it away, as that would cause problems in concurrent settings. For this reason, IORef or STRef shouldn't be used like variables in C, for example. Performance will for sure be very bad. Let's verify the performance hit by considering the following ST-based sum-of-range implementation: -- file: sum_mutable.hs import Control.Monad.ST import Data.STRef count_st :: Int ->Int count_st n = runST $ do ref <- newSTRef 0 let go 0 = readSTRef ref go i = modifySTRef' ref (+ i) >> go (i - 1) go n And compare it to this pure recursive implementation: count_pure :: Int ->Int count_pure n = go n 0 where go 0 s = s go i s = go (i - 1) $! (s + i) The ST implementation is many times slower when at least -O is enabled. Without optimizations, the two functions are more or less equivalent in performance;there is similar amount of indirection from not unboxing arguments in the latter version. This is one example of the wonders that can be done to optimize referentially transparent code. Bubble sort with vectors Bubble sort is not an efficient sort algorithm, but because it's an in-place algorithm and simple, we will implement it as a demonstration of mutable vectors: -- file: bubblesort.hs import Control.Monad.ST import Data.Vector as V import Data.Vector.Mutable as MV import System.Random (randomIO) -- for testing The (naive) bubble sort compares values of all adjacent indices in order, and swaps the values if necessary. After reaching the last element, it starts from the beginning or, if no swaps were made, the list is sorted and the algorithm is done: bubblesortM :: (Ord a, PrimMonad m) =>MVector (PrimState m) a -> m () bubblesortM v = loop where indices = V.fromList [1 .. MV.length v - 1] loop = do swapped <- V.foldM' f False indices – (1) if swapped then loop else return () – (2) f swapped i = do – (3) a <- MV.read v (i-1) b <- MV.read v i if a > b then MV.swap v (i-1) i>> return True else return swapped At (1), we fold monadically over all but the last index, keeping state about whether or not we have performed a swap in this iteration. If we had, at (2) we rerun the fold or, if not, we can return. At (3) we compare an index and possibly swap values. We can write a pure function that wraps the stateful algorithm: bubblesort :: Ord a => Vector a -> Vector a bubblesort v = runST $ do mv <- V.thaw v bubblesortM mv V.freeze mv V.thaw and V.freeze (both O(n)) can be used to go back and forth with mutable and immutable vectors. Now, there are multiple code optimization opportunities in our implementation of bubble sort. But before tackling those, let's see how well our straightforward implementation fares using the following main: main = do v <- V.generateM 10000 $ _ ->randomIO :: IO Double let v_sorted = bubblesort v median = v_sorted ! 5000 print median We should remember to compile with -O2. On my machine, this program takes about 1.55s, and Runtime System reports 99.9% productivity, 18.7 megabytes allocated heap and 570 Kilobytes copied during GC. So now with a baseline, let's see if we can squeeze out more performance from vectors. This is a non-exhaustive list: Use unboxed vectors instead. This restricts the types of elements we can store, but it saves us a level of indirection. Down to 960ms and approximately halved GC traffic. Large lists are inefficient, and they don't compose with vectors stream fusion. We should change indices so that it uses V.enumFromTo instead (alternatively turn on OverloadedLists extension and drop V.fromList). Down to 360ms and 94% less GC traffic. Conversion functions V.thaw and V.freeze are O(n), that is, they modify copies. Using in-place V.unsafeThaw and V.unsafeFreeze instead is sometimes useful. V.unsafeFreeze in the bubblesort wrapper is completely safe, but V.unsafeThaw is not. In our example, however, with -O2, the program is optimized into a single loop and all those conversions get eliminated. Vector operations (V.read, V.swap) in bubblesortM are guaranteed to never be out of bounds, so it's perfectly safe to replace these with unsafe variants (V.unsafeRead, V.unsafeSwap) that don't check bounds. Speed-up of about 25 milliseconds, or 5%. To summarize, applying good practices and safe usage of unsafe functions, our Bubble sort just got 80% faster. These optimizations are applied in thebubblesort-optimized.hsfile (omitted here). We noticed that almost all GC traffic came from a linked list, which was constructed and immediately consumed. Lists are bad for performance in that they don't fuse like vectors. To ensure good vector performance, ensure that the fusion framework can work effectively. Anything that can be done with a vector should be done. As final note, when working with vectors(and other libraries) it's a good idea to keep the Haddock documentation handy. There are several big and small performance choices to be made. Often the difference is that of between O(n) and O(1). Speedup via continuation-passing style Implementing monads in continuation-passing style (CPS) can have very good results. Unfortunately, no widely-used or supported library I'm aware of would provide drop-in replacements for ubiquitous Maybe, List, Reader, Writer, and State monads. It's not that hard to implement the standard monads in CPS from scratch. For example, the State monad can be implemented using the Cont monad from mtl as follows: -- file: cont-state-writer.hs {-# LANGUAGE GeneralizedNewtypeDeriving #-} {-# LANGUAGE MultiParamTypeClasses #-} {-# LANGUAGE FlexibleInstances #-} {-# LANGUAGE FlexibleContexts #-} import Control.Monad.State.Strict import Control.Monad.Cont newtypeStateCPS s r a = StateCPS (Cont (s -> r) a) deriving (Functor, Applicative, Monad, MonadCont) instance MonadState s (StateCPS s r) where get = StateCPS $ cont $ next curState→ next curStatecurState put newState = StateCPS $ cont $ next curState→ next () newState runStateCPS :: StateCPS s s () -> s -> s runStateCPS (StateCPS m) = runCont m (_ -> id) In case you're not familiar with the continuation-passing style and the Cont monad, the details might not make much sense, instead of just returning results from a function, a function in CPS applies its results to a continuation. So in short, to "get" the state in continuation-passing style, we pass the current state to the "next" continuation (first argument) and don't change the state (second argument). To "put", we call the continuation with unit (no return value) and change the state to new state (second argument to next). StateCPS is used just like the State monad: action :: MonadStateInt m => m () action = replicateM_ 1000000 $ do i<- get put $! i + 1 main = do print $ (runStateCPS action 0 :: Int) print $ (snd $ runState action 0 :: Int) That action operation is, in the CPS version of the state monad, about 5% faster and performs 30% less heap allocation than the state monad from mtl. This program is limited pretty much only by the speed of monadic composition, so these numbers are at least very close to maximum speedup we can have from CPSing the state monad. Speedups of the writer monad are probably near these results. Other standard monads can be implemented similarly to StateCPS. The definitions can also be generalized to monad transformers over an arbitrary monad (a la ContT). For extra speed, you might wish to combine many monads in a single CPS monad, similarly to what RWST does. Summary We witnessed the performance of the bytestring, text, and vector libraries, all of which get their speed from fusion optimizations, in contrast to linked lists, which have a huge overhead despite also being subject to fusion to some degree. However, linked lists give rise to simple difference lists and zippers. The builder patterns for lists, bytestring, and textwere introduced. We discovered that the array package is low-level and clumsy compared to the superior vector package, unless you must support Haskell 98. We also saw how to implement Bubble sort using vectors and how to speedup via continuation-passing style. Resources for Article: Further resources on this subject: Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article] Data Extracting, Transforming, and Loading [article] Linking Data to Shapes [article]
Read more
  • 0
  • 0
  • 1908

article-image-exploring-scala-performance
Packt
19 May 2016
19 min read
Save for later

Exploring Scala Performance

Packt
19 May 2016
19 min read
In this article by Michael Diamant and Vincent Theron, author of the book Scala High Performance Programming, we look at how Scala features get compiled with bytecode. (For more resources related to this topic, see here.) Value classes The domain model of the order book application included two classes, Price and OrderId. We pointed out that we created domain classes for Price and OrderId to provide contextual meanings to the wrapped BigDecimal and Long. While providing us with readable code and compilation time safety, this practice also increases the amount of instances that are created by our application. Allocating memory and generating class instances create more work for the garbage collector by increasing the frequency of collections and by potentially introducing additional long-lived objects. The garbage collector will have to work harder to collect them, and this process may severely impact our latency. Luckily, as of Scala 2.10, the AnyVal abstract class is available for developers to define their own value classes to solve this problem. The AnyVal class is defined in the Scala doc (http://www.scala-lang.org/api/current/#scala.AnyVal) as, "the root class of all value types, which describe values not implemented as objects in the underlying host system." The AnyVal class can be used to define a value class, which receives special treatment from the compiler. Value classes are optimized at compile time to avoid the allocation of an instance, and instead, they use the wrapped type. Bytecode representation As an example, to improve the performance of our order book, we can define Price and OrderId as value classes: case class Price(value: BigDecimal) extends AnyVal case class OrderId(value: Long) extends AnyVal To illustrate the special treatment of value classes, we define a dummy function taking a Price value class and an OrderId value class as arguments: def printInfo(p: Price, oId: OrderId): Unit = println(s"Price: ${p.value}, ID: ${oId.value}") From this definition, the compiler produces the following method signature: public void printInfo(scala.math.BigDecimal, long); We see that the generated signature takes a BigDecimal object and a long object, even though the Scala code allows us to take advantage of the types defined in our model. This means that we cannot use an instance of BigDecimal or Long when calling printInfo because the compiler will throw an error. An interesting thing to notice is that the second parameter of printInfo is not compiled as Long (an object), but long (a primitive type, note the lower case 'l'). Long and other objects matching to primitive types, such as Int,Float or Short, are specially handled by the compiler to be represented by their primitive type at runtime. Value classes can also define methods. Let's enrich our Price class, as follows: case class Price(value: BigDecimal) extends AnyVal { def lowerThan(p: Price): Boolean = this.value < p.value } // Example usage val p1 = Price(BigDecimal(1.23)) val p2 = Price(BigDecimal(2.03)) p1.lowerThan(p2) // returns true Our new method allows us to compare two instances of Price. At compile time, a companion object is created for Price. This companion object defines a lowerThan method that takes two BigDecimal objects as parameters. In reality, when we call lowerThan on an instance of Price, the code is transformed by the compiler from an instance method call to a static method call that is defined in the companion object: public final boolean lowerThan$extension(scala.math.BigDecimal, scala.math.BigDecimal); Code: 0: aload_1 1: aload_2 2: invokevirtual #56 // Method scala/math/BigDecimal.$less:(Lscala/math/BigDecimal;)Z 5: ireturn If we were to write the pseudo-code equivalent to the preceding Scala code, it would look something like the following: val p1 = BigDecimal(1.23) val p2 = BigDecimal(2.03) Price.lowerThan(p1, p2) // returns true   Performance considerations Value classes are a great addition to our developer toolbox. They help us reduce the count of instances and spare some work for the garbage collector, while allowing us to rely on meaningful types that reflect our business abstractions. However, extending AnyVal comes with a certain set of conditions that the class must fulfill. For example, a value class may only have one primary constructor that takes one public val as a single parameter. Furthermore, this parameter cannot be a value class. We saw that value classes can define methods via def. Neither val nor var are allowed inside a value class. A nested class or object definitions are also impossible. Another limitation prevents value classes from extending anything other than a universal trait, that is, a trait that extends Any, only has defs as members, and performs no initialization. If any of these conditions is not fulfilled, the compiler generates an error. In addition to the preceding constraints that are listed, there are special cases in which a value class has to be instantiated by the JVM. Such cases include performing a pattern matching or runtime type test, or assigning a value class to an array. An example of the latter looks like the following snippet: def newPriceArray(count: Int): Array[Price] = { val a = new Array[Price](count) for(i <- 0 until count){ a(i) = Price(BigDecimal(Random.nextInt())) } a } The generated bytecode is as follows: public highperfscala.anyval.ValueClasses$$anonfun$newPriceArray$1(highperfscala.anyval.ValueClasses$Price[]); Code: 0: aload_0 1: aload_1 2: putfield #29 // Field a$1:[Lhighperfscala/anyval/ValueClasses$Price; 5: aload_0 6: invokespecial #80 // Method scala/runtime/AbstractFunction1$mcVI$sp."<init>":()V 9: return public void apply$mcVI$sp(int); Code: 0: aload_0 1: getfield #29 // Field a$1:[Lhighperfscala/anyval/ValueClasses$Price; 4: iload_1 5: new #31 // class highperfscala/anyval/ValueClasses$Price // omitted for brevity 21: invokevirtual #55 // Method scala/math/BigDecimal$.apply:(I)Lscala/math/BigDecimal; 24: invokespecial #59 // Method highperfscala/anyval/ValueClasses$Price."<init>":(Lscala/math/BigDecimal;)V 27: aastore 28: return Notice how mcVI$sp is invoked from newPriceArray, and this creates a new instance of ValueClasses$Price at the 5 instruction. As turning a single field case class into a value class is as trivial as extending the AnyVal trait, we recommend that you always use AnyVal wherever possible. The overhead is quite low, and it generate high benefits in terms of garbage collection's performance. To learn more about value classes, their limitations and use cases, you can find detailed descriptions at http://docs.scala-lang.org/overviews/core/value-classes.html. Tagged types – an alternative to value classes Value classes are an easy to use tool, and they can yield great improvements in terms of performance. However, they come with a constraining set of conditions, which can make them impossible to use in certain cases. We will conclude this section with a glance at an interesting alternative to leveraging the tagged type feature that is implemented by the Scalaz library. The Scalaz implementation of tagged types is inspired by another Scala library, named shapeless. The shapeless library provides tools to write type-safe, generic code with minimal boilerplate. While we will not explore shapeless, we encourage you to learn more about the project at https://github.com/milessabin/shapeless. Tagged types are another way to enforce compile-type checking without incurring the cost of instance instantiation. They rely on the Tagged structural type and the @@ type alias that is defined in the Scalaz library, as follows: type Tagged[U] = { type Tag = U } type @@[T, U] = T with Tagged[U] Let's rewrite part of our code to leverage tagged types with our Price object: object TaggedTypes { sealed trait PriceTag type Price = BigDecimal @@ PriceTag object Price { def newPrice(p: BigDecimal): Price = Tag[BigDecimal, PriceTag](p) def lowerThan(a: Price, b: Price): Boolean = Tag.unwrap(a) < Tag.unwrap(b) } } Let's perform a short walkthrough of the code snippet. We will define a PriceTag sealed trait that we will use to tag our instances, a Price type alias is created and defined as a BigDecimal object tagged with PriceTag. The Price object defines useful functions, including the newPrice factory function that is used to tag a given BigDecimal object and return a Price object (that is, a tagged BigDecimal object). We will also implement an equivalent to the lowerThan method. This function takes two Price objects (that is two tagged BigDecimal objects), extracts the content of the tag that are two BigDecimal objects, and compares them. Using our new Price type, we rewrite the same newPriceArray function that we previously looked at (the code is omitted for brevity, but you can refer to it in the attached source code), and print the following generated bytecode: public void apply$mcVI$sp(int); Code: 0: aload_0 1: getfield #29 // Field a$1:[Ljava/lang/Object; 4: iload_1 5: getstatic #35 // Field highperfscala/anyval/TaggedTypes$Price$.MODULE$:Lhighperfscala/anyval/TaggedTypes$Price$; 8: getstatic #40 // Field scala/package$.MODULE$:Lscala/package$; 11: invokevirtual #44 // Method scala/package$.BigDecimal:()Lscala/math/BigDecimal$; 14: getstatic #49 // Field scala/util/Random$.MODULE$:Lscala/util/Random$; 17: invokevirtual #53 // Method scala/util/Random$.nextInt:()I 20: invokevirtual #58 // Method scala/math/BigDecimal$.apply:(I)Lscala/math/BigDecimal; 23: invokevirtual #62 // Method highperfscala/anyval/TaggedTypes$Price$.newPrice:(Lscala/math/BigDecimal;)Ljava/lang/Object; 26: aastore 27: return In this version, we no longer see an instantiation of Price, even though we are assigning them to an array. The tagged Price implementation involves a runtime cast, but we anticipate that the cost of this cast will be less than the instance allocations (and garbage collection) that was observed in the previous value class Price strategy. Specialization To understand the significance of specialization, it is important to first grasp the concept of object boxing. The JVM defines primitive types (boolean, byte, char, float, int, long, short, and double) that are stack allocated rather than heap allocated. When a generic type is introduced, for example, scala.collection.immutable.List, the JVM references an object equivalent, instead of a primitive type. In this example, an instantiated list of integers would be heap allocated objects rather than integer primitives. The process of converting a primitive to its object equivalent is called boxing, and the reverse process is called unboxing. Boxing is a relevant concern for performance-sensitive programming because boxing involves heap allocation. In performance-sensitive code that performs numerical computations, the cost of boxing and unboxing can create an order of magnitude or larger performance slowdowns. Consider the following example to illustrate boxing overhead: List.fill(10000)(2).map(_* 2) Creating the list via fill yields 10,000 heap allocations of the integer object. Performing the multiplication in map requires 10,000 unboxings to perform multiplication and then 10,000 boxings to add the multiplication result into the new list. From this simple example, you can imagine how critical section arithmetic will be slowed down due to boxing or unboxing operations. As shown in Oracle's tutorial on boxing at https://docs.oracle.com/javase/tutorial/java/data/autoboxing.html, boxing in Java and also in Scala happens transparently. This means that without careful profiling or bytecode analysis, it is difficult to discern where you are paying the cost for object boxing. To ameliorate this problem, Scala provides a feature named specialization. Specialization refers to the compile-time process of generating duplicate versions of a generic trait or class that refer directly to a primitive type instead of the associated object wrapper. At runtime, the compiler-generated version of the generic class, or as it is commonly referred to, the specialized version of the class, is instantiated. This process eliminates the runtime cost of boxing primitives, which means that you can define generic abstractions while retaining the performance of a handwritten, specialized implementation. Bytecode representation Let's look at a concrete example to better understand how the specialization process works. Consider a naive, generic representation of the number of shares purchased, as follows: case class ShareCount[T](value: T) For this example, let's assume that the intended usage is to swap between an integer or long representation of ShareCount. With this definition, instantiating a long-based ShareCount instance incurs the cost of boxing, as follows: def newShareCount(l: Long): ShareCount[Long] = ShareCount(l) This definition translates to the following bytecode: public highperfscala.specialization.Specialization$ShareCount<java.lang.Object> newShareCount(long); Code: 0: new #21 // class orderbook/Specialization$ShareCount 3: dup 4: lload_1 5: invokestatic #27 // Method scala/runtime/BoxesRunTime.boxToLong:(J)Ljava/lang/Long; 8: invokespecial #30 // Method orderbook/Specialization$ShareCount."<init>":(Ljava/lang/Object;)V 11: areturn In the preceding bytecode, it is clear in the 5 instruction that the primitive long value is boxed before instantiating the ShareCount instance. By introducing the @specialized annotation, we are able to eliminate the boxing by having the compiler provide an implementation of ShareCount that works with primitive long values. It is possible to specify which types you wish to specialize by supplying a set of types. As defined in the Specializables trait (http://www.scala-lang.org/api/current/index.html#scala.Specializable), you are able to specialize for all JVM primitives, such as Unit and AnyRef. For our example, let's specialize ShareCount for integers and longs, as follows: case class ShareCount[@specialized(Long, Int) T](value: T) With this definition, the bytecode now becomes the following: public highperfscala.specialization.Specialization$ShareCount<java.lang.Object> newShareCount(long); Code: 0: new #21 // class highperfscala.specialization/Specialization$ShareCount$mcJ$sp 3: dup 4: lload_1 5: invokespecial #24 // Method highperfscala.specialization/Specialization$ShareCount$mcJ$sp."<init>":(J)V 8: areturn The boxing disappears and is curiously replaced with a different class name, ShareCount $mcJ$sp. This is because we are invoking the compiler-generated version of ShareCount that is specialized for long values. By inspecting the output of javap, we see that the specialized class generated by the compiler is a subclass of ShareCount: public class highperfscala.specialization.Specialization$ShareCount$mcI$sp extends highperfscala.specialization.Specialization$ShareCount<java .lang.Object> Bear this specialization implementation detail in mind as we turn to the Performance considerations section. The use of inheritance forces tradeoffs to be made in more complex use cases. Performance considerations At first glance, specialization appears to be a simple panacea for JVM boxing. However, there are several caveats to consider when using specialization. A liberal use of specialization leads to significant increases in compile time and resulting code size. Consider specializing Function3, which accepts three arguments as input and produces one result. To specialize four arguments across all types (that is, Byte, Short, Int, Long, Char, Float, Double, Boolean, Unit, and AnyRef) yields 10^4 or 10,000 possible permutations. For this reason, the standard library conserves application of specialization. In your own use cases, consider carefully which types you wish to specialize. If we specialize Function3 only for Int and Long, the number of generated classes shrinks to 2^4 or 16. Specialization involving inheritance requires extra attention because it is trivial to lose specialization when extending a generic class. Consider the following example: class ParentFoo[@specialized T](t: T) class ChildFoo[T](t: T) extends ParentFoo[T](t) def newChildFoo(i: Int): ChildFoo[Int] = new ChildFoo[Int](i) In this scenario, you likely expect that ChildFoo is defined with a primitive integer. However, as ChildFoo does not mark its type with the @specialized annotation, zero specialized classes are created. Here is the bytecode to prove it: public highperfscala.specialization.Inheritance$ChildFoo<java.lang.Object> newChildFoo(int); Code: 0: new #16 // class highperfscala/specialization/Inheritance$ChildFoo 3: dup 4: iload_1 5: invokestatic #22 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer; 8: invokespecial #25 // Method highperfscala/specialization/Inheritance$ChildFoo."<init>":(Ljava/lang/Object;)V 11: areturn The next logical step is to add the @specialized annotation to the definition of ChildFoo. In doing so, we stumble across a scenario where the compiler warns about use of specialization, as follows: class ParentFoo must be a trait. Specialized version of class ChildFoo will inherit generic highperfscala.specialization.Inheritance.ParentFoo[Boolean] class ChildFoo[@specialized T](t: T) extends ParentFoo[T](t) The compiler indicates that you have created a diamond inheritance problem, where the specialized versions of ChildFoo extend both ChildFoo and the associated specialized version of ParentFoo. This issue can be resolved by modeling the problem with a trait, as follows: trait ParentBar[@specialized T] { def t(): T } class ChildBar[@specialized T](val t: T) extends ParentBar[T] def newChildBar(i: Int): ChildBar[Int] = new ChildBar(i) This definition compiles using a specialized version of ChildBar, as we originally were hoping for, as see in the following code: public highperfscala.specialization.Inheritance$ChildBar<java.lang.Object> newChildBar(int); Code: 0: new #32 // class highperfscala/specialization/Inheritance$ChildBar$mcI$sp 3: dup 4: iload_1 5: invokespecial #35 // Method highperfscala/specialization/Inheritance$ChildBar$mcI$sp."<init>":(I)V 8: areturn An analogous and equally error-prone scenario is when a generic function is defined around a specialized type. Consider the following definition: class Foo[T](t: T) object Foo { def create[T](t: T): Foo[T] = new Foo(t) } def boxed: Foo[Int] = Foo.create(1) Here, the definition of create is analogous to the child class from the inheritance example. Instances of Foo wrapping a primitive that are instantiated from the create method will be boxed. The following bytecode demonstrates how boxed leads to heap allocations: public highperfscala.specialization.MethodReturnTypes$Foo<java.lang.Object> boxed(); Code: 0: getstatic #19 // Field highperfscala/specialization/MethodReturnTypes$Foo$.MODULE$:Lhighperfscala/specialization/MethodReturnTypes$Foo$; 3: iconst_1 4: invokestatic #25 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer; 7: invokevirtual #29 // Method highperfscala/specialization/MethodReturnTypes$Foo$.create:(Ljava/lang/Object;)Lhighperfscala/specialization/MethodReturnTypes$Foo; 10: areturn The solution is to apply the @specialized annotation at the call site, as follows: def createSpecialized[@specialized T](t: T): Foo[T] = new Foo(t) The solution is to apply the @specialized annotation at the call site, as follows: def createSpecialized[@specialized T](t: T): Foo[T] = new Foo(t) One final interesting scenario is when specialization is used with multiple types and one of the types extends AnyRef or is a value class. To illustrate this scenario, consider the following example: case class ShareCount(value: Int) extends AnyVal case class ExecutionCount(value: Int) class Container2[@specialized X, @specialized Y](x: X, y: Y) def shareCount = new Container2(ShareCount(1), 1) def executionCount = new Container2(ExecutionCount(1), 1) def ints = new Container2(1, 1) In this example, which methods do you expect to box the second argument to Container2? For brevity, we omit the bytecode, but you can easily inspect it yourself. As it turns out, shareCount and executionCount box the integer. The compiler does not generate a specialized version of Container2 that accepts a primitive integer and a value extending AnyVal (for example, ExecutionCount). The shareCount variable also causes boxing due to the order in which the compiler removes the value class type information from the source code. In both scenarios, the workaround is to define a case class that is specific to a set of types (for example, ShareCount and Int). Removing the generics allows the compiler to select the primitive types. The conclusion to draw from these examples is that specialization requires extra focus to be used throughout an application without boxing. As the compiler is unable to infer scenarios where you accidentally forgot to apply the @specialized annotation, it fails to raise a warning. This places the onus on you to be vigilant about profiling and inspecting bytecode to detect scenarios where specialization is incidentally dropped. To combat some of the shortcomings that specialization brings, there is a compiler plugin under active development, named miniboxing, at http://scala-miniboxing.org/. This compiler plugin applies a different strategy that involves encoding all primitive types into a long value and carrying metadata to recall the original type. For example, boolean can be represented in long using a single bit to signal true or false. With this approach, performance is qualitatively similar to specialization while producing orders of magnitude for fewer classes for large permutations. Additionally, miniboxing is able to more robustly handle inheritance scenarios and can warn when boxing will occur. While the implementations of specialization and miniboxing differ, the end user usage is quite similar. Like specialization, you must add appropriate annotations to activate the miniboxing plugin. To learn more about the plugin, you can view the tutorials on the miniboxing project site. The extra focus to ensure specialization produces heap-allocation free code is worthwhile because of the performance wins in performance-sensitive code. To drive home the value of specialization, consider the following microbenchmark that computes the cost of a trade by multiplying share count with execution price. For simplicity, primitive types are used directly instead of value classes. Of course, in production code this would never happen: @BenchmarkMode(Array(Throughput)) @OutputTimeUnit(TimeUnit.SECONDS) @Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 30, time = 10, timeUnit = TimeUnit.SECONDS) @Fork(value = 1, warmups = 1, jvmArgs = Array("-Xms1G", "-Xmx1G")) class SpecializationBenchmark { @Benchmark def specialized(): Double = specializedExecution.shareCount.toDouble * specializedExecution.price @Benchmark def boxed(): Double = boxedExecution.shareCount.toDouble * boxedExecution.price } object SpecializationBenchmark { class SpecializedExecution[@specialized(Int) T1, @specialized(Double) T2]( val shareCount: Long, val price: Double) class BoxingExecution[T1, T2](val shareCount: T1, val price: T2) val specializedExecution: SpecializedExecution[Int, Double] = new SpecializedExecution(10l, 2d) val boxedExecution: BoxingExecution[Long, Double] = new BoxingExecution(10l, 2d) } In this benchmark, two versions of a generic execution class are defined. SpecializedExecution incurs zero boxing when computing the total cost because of specialization, while BoxingExecution requires object boxing and unboxing to perform the arithmetic. The microbenchmark is invoked with the following parameterization: sbt 'project chapter3' 'jmh:run SpecializationBenchmark -foe true' We configure this JMH benchmark via annotations that are placed at the class level in the code. Annotations have the advantage of setting proper defaults for your benchmark, and simplifying the command-line invocation. It is still possible to override the values in the annotation with command-line arguments. We use the -foe command-line argument to enable failure on error because there is no annotation to control this behavior. In the rest of this book, we will parameterize JMH with annotations and omit the annotations in the code samples because we always use the same values. The results are summarized in the following table: Benchmark Throughput (ops per second) Error as percentage of throughput boxed 251,534,293.11 ±2.23 specialized 302,371,879.84 ±0.87 This microbenchmark indicates that the specialized implementation yields approximately 17% higher throughput. By eliminating boxing in a critical section of the code, there is an order of magnitude performance improvement available through judicious usage of specialization. For performance-sensitive arithmetic, this benchmark provides justification for the extra effort that is required to ensure that specialization is applied properly. Summary This article talk about different Scala constructs and features. It also explained different features and how they get compiled with bytecode. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [article] Integrating Scala, Groovy, and Flex Development with Apache Maven [article] Cluster Computing Using Scala [article]
Read more
  • 0
  • 0
  • 5349

article-image-cluster-basics-and-installation-centos-7
Packt
01 Feb 2016
8 min read
Save for later

Cluster Basics and Installation On CentOS 7

Packt
01 Feb 2016
8 min read
In this article by Gabriel A. Canepa, author of the book CentOS High Performance, we will review the basic principles of clustering and show you, step by step, how to set up two CentOS 7 servers as nodes to later use them as members of a cluster. (For more resources related to this topic, see here.) As part of this process, we will install CentOS 7 from scratch in a brand new server as our first cluster member, along with the necessary packages, and finally, configure key-based authentication for SSH access from one node to the other. Clustering fundamentals In computing, a cluster consists of a group of computers (which are referred to as nodes or members) that work together so that the set is seen as a single system from the outside. One typical cluster setup involves assigning a different task to each node, thus achieving a higher performance than if several tasks were performed by a single member on its own. Another classic use of clustering is helping to ensure high availability by providing failover capabilities to the set, where one node may automatically replace a failed member to minimize the downtime of one or several critical services. In either case, the concept of clustering implies not only taking advantage of the computing functionality of each member alone, but also maximizing it by complementing it with the others. As we just mentioned, HA (High-availability) clusters aim to eliminate system downtime by failing services from one node to another in case one of them experiences an issue that renders it inoperative. As opposed to switchover, which requires human intervention, a failover procedure is performed automatically by the cluster without any downtime. In other words, this operation is transparent to end users and clients from outside the cluster. On the other hand, HP (High-performance) clusters use their nodes to perform operations in parallel in order to enhance the performance of one or more applications. High-performance clusters are typically seen in scenarios involving applications that use large collections of data. Why CentOS? Just as the saying goes, Every journey begins with a small step, we will begin our own journey toward clustering by setting up the separate nodes that will make up our system. Our choice of operating system is Linux and CentOS, version 7, as the distribution, that being the latest available release of CentOS as of today. The binary compatibility with Red Hat Enterprise Linux © (which is one of the most well-used distributions in enterprise and scientific environments) along with its well-proven stability are the reasons behind this decision. CentOS 7 along with its previous versions of the distribution are available for download, free of charge, from the project's website at http://www.centos.org/. In addition, specific details about the release can always be consulted in the CentOS wiki, http://wiki.centos.org/Manuals/ReleaseNotes/CentOS7. Among the distinguishing features of CentOS 7, I would like to name the following: It includes systemd as the central system management and configuration utility It uses XFS as the default filesystem It only supports the x86_64 architecture Downloading CentOS To download CentOS, go to http://www.centos.org/download/ and click on one of the three options outlined in the following figure: Download options for CentOS 7 These options are detailed as follows: DVD ISO (~4 GB) is an .iso file that can be burned into regular DVD optical media and includes the common tools. Download this file if you have immediate access to a reliable Internet connection that you can use to download other packages and utilities. Everything ISO (~7 GB) is an .iso file with the complete set of packages that are made available in the base repository of CentOS 7. Download this file if you do not have access to a reliable Internet connection or if your plan contemplates the possibility of installing or populating a local or network mirror. The alternative downloads link will take you to a public directory within an official nearby CentOS mirror, where the previous options are available as well as others, including different choices of desktop versions (GNOME or KDE) and the minimal .iso file (~570 MB), which contains the bare bone packages of the distribution. As the minimal install is sufficient for our purpose at hand, we can install other needed packages using yum later, that is, the recommended .iso file to download. CentOS-7.X-YYMM-x86_64-Minimal.iso Here, X indicates the current update number of CentOS 7 and YYMM represent the year and month, both in two-digit notation, when the source code this version is based on was released. CentOS-7.0-1406-x86_64-Minimal.iso This tells us the source code this release is based on dates from the month of June, 2014. Independently of our preferred download method, we will need this .iso file in order to begin with the installation. In addition, feel free to burn it to optical media or a USB drive. Setting up CentOS 7 nodes If you do not have dedicated hardware that you can use to set up the nodes of your cluster, you can still create one using virtual machines over some virtualization software, such as Oracle Virtualbox © or VMware ©, for example. The following setup is going to be performed on a Virtualbox VM with 1 GB of RAM and 30 GB of disk space. We will use the default partitioning schema over LVM as suggested by the installation process. Installing CentOS 7 The splash screen shown in the following screenshot is the first step in the installation process. Highlight Install CentOS 7 using the up and down arrows and press Enter: Splash screen before starting the installation of CentOS 7 Select English (or your preferred installation language) and click on Continue, as shown in the following screenshot: Selecting the language for the installation of CentOS 7 In the following screenshot, you can choose a keyboard layout, set the current date and time, choose a partitioning method, connect the main network interface, and assign a unique hostname for the node. We will name the current node node01 and leave the rest of the settings as default (we will configure the extra network card later). Then, click on Begin installation: Configure keyboard layout, date and time, network and hostname, and partitioning schema While the installation continues in the background, we will be prompted to set the password for the root account and create an administrative user for the node. Once these steps have been confirmed, the corresponding warnings no longer appear, as shown in the following screenshot: Setting the password for root and creating an administrative user account When the process is completed, click on Finish configuration and the installation will finish configuring the system and devices. When the system is ready to boot on its own, you will be prompted to do so. Remove the installation media and click on Reboot. Now, we can proceed with setting up our network interfaces. Setting up the network infrastructure Our rather basic network infrastructure consists of 2 CentOS 7 boxes, with the node01 [192.168.0.2] and node02 [192.168.0.3] host names, respectively, and a gateway router called simply gateway [192.168.0.1]. In CentOS, network cards are configured using scripts in the /etc/sysconfig/network-scripts directory. This is the minimum content that is needed in /etc/sysconfig/network-scripts/ifcfg-enp0s3 for our purposes: HWADDR="08:00:27:C8:C2:BE" TYPE="Ethernet" BOOTPROTO="static" NAME="enp0s3" ONBOOT="yes" IPADDR="192.168.0.2" NETMASK="255.255.255.0" GATEWAY="192.168.0.1" PEERDNS="yes" DNS1="8.8.8.8" DNS2="8.8.4.4" Note that the UUID and HWADDR values will be different in your case. In addition, be aware that cluster machines need to be assigned a static IP address—never leave that up to DHCP! In the preceding configuration file, we used Google's DNS, but if you wish, feel free to use another DNS. When you're done making changes, save the file and restart the network service in order to apply them: systemctl restart network.service # Restart the network service You can verify that the previous changes have taken effect (shown in the Restarting the network service and verifying settings figure) with the following two commands: systemctl status network.service # Display the status of the network service And the changes have also taken effect due to this command: ip addr | grep 'inet addr' # Display the IP addresse Restarting the network service and verifying settings You can disregard all error messages related to the loopback interface, as shown in preceding screenshot. However, you will need to examine carefully any error messages related to the enp0s3 interface, if any, and get them resolved in order to proceed further. The second interface will be called enp0sX, where X is typically 8. You can verify with the following command (shown in the following figure): ip link show Displaying NIC information As for the configuration file of enp0s8, you can safely create it, copying the contents of ifcfg-enp0s3. Do not forget, however, to change the hardware (MAC) address as returned by the information on the NIC and leave the IP address field blank for now. ip link show enp0s8 cp /etc/sysconfig/network-scripts/ifcfg-enp0s3 /etc/sysconfig/network-scripts/ifcfg-enp0s8 Then, restart the network service. Note that you will also need to set up at least a basic DNS resolution method. Considering that we will set up a cluster with 2 nodes only, we will use /etc/hosts for this purpose. Edit /etc/hosts with the following content: 192.168.0.2 node01 192.168.0.3 node02 192.168.0.1 gateway Summary In this article, we reviewed how to install the operating system and listed the necessary software components to implement the basic cluster functionality. Resources for Article: Further resources on this subject: CentOS 7's new firewalld service[article] Mastering CentOS 7 Linux Server[article] Resource Manager on CentOS 6[article]
Read more
  • 0
  • 0
  • 9921

article-image-distributed-resource-scheduler
Packt
07 Jan 2016
14 min read
Save for later

Distributed resource scheduler

Packt
07 Jan 2016
14 min read
In this article written by Christian Stankowic, author of the book vSphere High Performance Essentials In cluster setups, Distributed Resource Scheduler (DRS) can assist you with automatic balancing CPU and storage load (Storage DRS). DRS monitors the ESXi hosts in a cluster and migrates the running VMs using vMotion, primarily, to ensure that all the VMs get the resources they need. Secondarily, it tries to balance the cluster. In addition to this, Storage DRS monitors the shared storage for information about latency and capacity consumption. In this case, Storage DRS recognizes the potential to optimize storage resources; it will make use of Storage vMotion to balance the load. We will cover Storage DRS in detail later. (For more resources related to this topic, see here.) Working of DRS DRS primarily uses two metrics to determine the cluster balance: Active host CPU: it includes the usage (CPU task time in ms) and ready (wait times in ms per VMs to get scheduled on physical cores) metrics. Active host Memory: It describes the amount of memory pages that are predicted to have changed in the last 20 seconds. A math-sampling algorithm calculates this amount; however, it is quite inaccurate. Active host memory is often used for resource capacity purposes. Be careful with using this value as an indicator as it only describes how aggressively a workload changes the memory. Depending on your application architecture, it may not measure how much memory a particular VM really needs. Think about applications that allocate a lot of memory for the purpose of caching. Using the active host memory metric for the purpose of capacity might lead to inappropriate settings. The migration threshold controls DRS's aggressiveness and defines how much a cluster can be imbalanced. Refer to the following table for detailed explanation: DRS level Priorities Effect Most conservative 1 Only affinity/anti-affinity constraints are applied More conservative 1–2 This will also apply recommendations addressing significant improvements Balanced (default) 1–3 Recommendations that, at least, promise good improvements are applied More aggressive 1–4 DRS applies recommendations that only promise a moderate improvement Most aggressive 1–5 Recommendations only addressing smaller improvements are applied Apart from the migration threshold, two other metrics—Target Host Load Standard Deviation (THLSD) and Current host load standard deviation (CHLSD)—are calculated. THLSD defines how much a cluster node's load can differ from others in order to be still balanced. The migration threshold and the particular ESXi host's active CPU and memory values heavily influence this metric. CHLSD calculates whether the cluster is currently balanced. If this value differs from the THLSD, the cluster is imbalanced and DRS will calculate the recommendations in order to balance it. In addition to this, DRS also calculates the vMotion overhead that is needed for the migration. If a migration's overhead is deemed higher than the benefit, vMotion will not be executed. DRS also evaluates the migration recommendations multiple times in order to avoid ping pong migrations. By default, once enabled, DRS is polled every five minutes (300 seconds). Depending on your landscape, it might be required to change this behavior. To do so, you need to alter the vpxd.cfg configuration file on the vCenter Server machine. Search for the following lines and alter the period (in seconds): <config>   <drm>     <pollPeriodSec>       300[SR1]      </pollPeriodSec>   </drm> </config> Refer to the following table for configuration file location, depending on your vCenter implementation: vCenter Server type File location vCenter Server Appliance /etc/vmware-vpx/vpxd.cfg vCenter Server C:ProgramDataVMwareVMware VirtualCentervpxd.cfg Check-list – performance tuning There are a couple of things to be considered when optimizing DRS for high-performance setups, as shown in the following: Make sure to use the hosts with homogenous CPU and memory configuration. Having different nodes will make DRS less effective. Use at least 1 Gbps network connection for vMotion. For better performance, it is recommended to use 10 Gbps instead. For virtual machines, it is a common procedure not to oversize them. Only configure as much as the CPU and memory resources need. Migrating workloads with unneeded resources takes more time. Make sure not to exceed the ESXi host and cluster limits that are mentioned in the VMware vSphere Configuration Maximums document. For vSphere 5.5, refer to https://www.vmware.com/pdf/vsphere5/r55/vsphere-55-configuration-maximums.pdf. For vSphere 6.0, refer to https://www.vmware.com/pdf/vsphere6/r60/vsphere-60-configuration-maximums.pdf. Configure DRS To configure DRS for your cluster, proceed with the following steps: Select your cluster from the inventory tab and click Manage and Settings. Under Services, select vSphere DRS. Click Edit. Select whether DRS should act in the Partially Automated or Fully Automated mode. In partially automated mode, DRS will place VMs in appropriate hosts, once powered on; however, it wil not migrate the running workloads. In fully automated mode, DRS will also migrate the running workloads in order to balance the cluster load. The Manual mode only gives you recommendations and the administrator can select the recommendations to apply. To create resource pools at cluster level, you will need to have at least the manual mode enabled. Select the DRS aggressiveness. Refer to the preceding table for a short explanation. Using more aggressive DRS levels is only recommended when having homogenous CPU and memory setups! When creating VMware support calls regarding DRS issues, a DRS dump file called drmdump is important. This file contains various metrics that DRS uses to calculate the possible migration benefits. On the vCenter Server Appliance, this file is located in /var/log/vmware/vpx/drmdump/clusterName. On the Windows variant, the file is located in %ALLUSERSPROFILE%VMwareVMware VirtualCenterLogsdrmdumpclusterName. VMware also offers an online tool called VM Resource and Availability Service (http://hasimulator.vmware.com), telling you which VMs can be restarted during the ESXi host failures. It requires you to upload this metric file in order to give you the results. This can be helpful when simulating the failure scenarios. Enhanced vMotion capability Enhanced vMotion Compatibility (EVC) enables your cluster to migrate the workloads between ESXi hosts with different processor generations. Unfortunately, it is not possible to migrate workloads between Intel-based and AMD-based servers; EVC only enables migrations in different Intel or AMD CPU generations. Once enabled, all the ESXi hosts are configured to provide the same set of CPU functions. In other words, the functions of newer CPU generations are disabled to match those of the older ESXi hosts in the cluster in order to create a common baseline. Configuring EVC To enable EVC, perform the following steps: Select the affected cluster from the inventory tab. Click on Manage, Settings, VMware EVC, and Edit. Choose Enable EVC for AMD Hosts or Enable EVC for Intel Hosts. Select the appropriate CPU generation for the cluster (the oldest). Make sure that Compatibility acknowledges your configuration. Save the changes, as follows: As mixing older hosts in high-performance clusters is not recommended, you should also avoid using EVC. To sum it up, keep the following steps in mind when planning the use of DRS: Enable DRS if you plan to have automatic load balancing; this is highly recommended for high-performance setups. Adjust the DRS aggressiveness level to match your requirements. Too aggressive migration thresholds may result in too many migrations, therefore, play with this setting to find the best for you. Make sure to have a separated vMotion network. Using the same logical network components as for the VM traffic is not recommended and might result in poor workload performance. Don't overload ESXi hosts to spare some CPU resources for vMotion processes in order to avoid performance bottlenecks during migrations. In high-performance setups, mixing various CPU and memory configurations is not recommended to achieve better performance. Try not to use EVC. Also, keep license constraints in mind when configuring DRS. Some software products might require additional licenses if it runs on multiple servers. We will focus on this later. Affinity and anti-affinity rules Sometimes, it is necessary to separate workloads or stick them together. To name some examples, think about the classical multi-tier applications such as the following: Frontend layer Database layer Backend layer One possibility would be to separate the particular VMs on multiple ESXi hosts to increase resilience. If a single ESXi host that is serving all the workloads crashes, all application components are affected by this fault. Moving all the participating application VMs to one single ESXi can result in higher performance as network traffic does not need to leave the ESXi host. However, there are more use cases to create affinity and anti-affinity rules, as shown in the following: Diving into production, development, and test workloads. For example, it would possible to separate production from the development and test workloads. This is a common procedure that many application vendors require. Licensing reasons (for example, license bound to the USB dongle, per core licensing, software assurance denying vMotion, and so on.) Application interoperability incompatibility (for example, applications need to run on separated hosts). As VMware vSphere has no knowledge about the license conditions of the workloads running virtualized, it is very important to check your software vendor's license agreements. You, as a virtual infrastructure administrator, are responsible to ensure that your software is fully licensed. Some software vendors require special licenses when running virtualized/on multiple hosts. There are two kinds of affinity/anti-affinity rules: VM-Host (relationship between VMs and ESXi hosts) and VM-VM (intra-relationship between particular VMs). Each rule consists of at least one VM and host DRS group. These groups also contain at least one entry. Every rule has a designation, where the administrator can choose between must or should. Implementing a rule with the should designation results in a preference on hosts satisfying all the configured rules. If no applicable host is found, the VM is put on another host in order to ensure at least the workload is running. If the must designation is selected, a VM is only running on hosts that are satisfying the configured rules. If no applicable host is found, the VM cannot be moved or started. This configuration approach is strict and requires excessive testing in order to avoid unplanned effects. DRS rules are rather combined than ranked. Therefore, if multiple rules are defined for a particular VM/host or VM/VM combination, the power-on process is only granted if all the rules apply to the requested action. If two rules are conflicting for a particular VM/host or VM/VM combination, the first rule is chosen and the other rule is automatically disabled. Especially, the use of the must rules should be evaluated very carefully as HA might not restart some workloads if these rules cannot be followed in case of a host crash. Configuring affinity/anti-affinity rules In this example, we will have a look at two use cases that affinity/anti-affinity rules can apply to. Example 1: VM-VM relationship This example consists of two VMs serving a two-tier application: db001 (database VM) and web001 (frontend VM). It is advisable to have both VMs running on the same physical host in order to reduce networking hops to connect the frontend server to its database. To configure the VM-VM affinity rule, proceed with the following steps: Select your cluster from the inventory tab and click Manage and VM/Host Rule underneath Configuration. Click Add. Enter a readable rule name (for example, db001-web001-bundle) and select Enable rule. Select the Keep Virtual Machines Together type and select the affected VMs. Click OK to save the rule, as shown in the following: When migrating one of the virtual machines using vMotion, the other VM will also migrate. Example 2: VM-Host relationship In this example, a VM (vcsa) is pinned to a particular ESXi host of a two-node cluster designated for production workloads. To configure the VM-Host affinity rule, proceed with the following steps: Select your cluster from the inventory tab and click Manage and VM/Host Groups underneath Configuration. Click Add. Enter a group name for the VM; make sure to select the VM Group type. Also, click Add to add the affected VM. Click Add once again. Enter a group name for the ESXi host; make sure to select the Host Group type. Later, click Add to add the ESXi host. Select VM/Host Rule underneath Configuration and click Add. Enter a readable rule name (for example, vcsa-to-esxi02) and select Enable rule. Select the Virtual Machines to Hosts type and select the previously created VM and host groups. Make sure to select Must run on hosts in group or Should run on hosts in group before clicking OK, as follows: Migrating the virtual machine to another host will fail with the following error message if Must run on hosts in group was selected earlier: Keep the following in mind when designing affinity and anti-affinity rules: Enable DRS. Double-check your software vendor's licensing agreements. Make sure to test your affinity/anti-affinity rules by simulating vMotion processes. Also, simulate host failures by using maintenance mode to ensure that your rules are working as expected. Note that the created rules also apply to HA and DPM. KISS – Keep it simple, stupid. Try to avoid utilizing too many or multiple rules for one VM/host combination. Distributed power management High performance setups are often the opposite of efficient, green infrastructures; however, high-performing virtual infrastructure setups can be efficient as well. Distributed Power Management (DPM) can help you with reducing the power costs and consumption of your virtual infrastructure. It is part of DRS and monitors the CPU and memory usage of all workloads running in the cluster. If it is possible to run all VMs on fewer hosts, DPM will put one or more ESXi hosts in standby mode (they will be powered off) after migrating the VMs using vMotion. DPM tries to keep the CPU and memory usage between 45% and 81% for all the cluster nodes by default. If this range is exceeded, the hosts will be powered on/off. Setting two advanced parameters can change this behaviour, as follows:DemandCapacityRatioTarget: Utilization target for the ESXi hosts (default: 63%)DemandCapacityRatioToleranceHost: Utilization range around target utilization (default 18%)The range is calculated as follows: (DemandCapacityRatioTarget - DemandCapacityRatioToleranceHost) to (DemandCapacityRatioTarget + DemandCapacityRatioToleranceHost)in this example we can calculate range: (63% - 18%) to (63% + 18%) To control a server's power state, DPM makes use of these three protocols in the following order: Intelligent Platform Management Interface (IPMI) Hewlett Packard Integrated Lights-Out (HP iLO) Wake-on-LAN (WoL) To enable IPMI/HP iLO management, you will need to configure the Baseboard Management Controller (BMC) IP address and other access information. To configure them, follow the given steps: Log in to vSphere Web Client and select the host that you want to configure for power management. Click on Configuration and select the Power Management tab. Select Properties and enter an IP address, MAC address, username, and password for the server's BMC. Note that entering hostnames will not work, as shown in the following: To enable DPM for a cluster, perform the following steps: Select the cluster from the inventory tab and select Manage. From the Services tab, select vSphere DRS and click Edit. Expand the Power Management tab and select Manual or Automatic. Also, select the threshold, DPM will choose to make power decisions. The higher the value, the faster DPM will put the ESXi hosts in standby mode, as follows: It is also possible to disable DPM for a particular host (for example, the strongest in your cluster). To do so; select the cluster and select Manage and Host Options. Check the host and click Edit. Make sure to select Disabled for the Power Management option. Consider giving a thought to the following when planning to utilize DPM: Make sure your server's have a supported BMC, such as HP iLO or IPMI. Evaluate the right DPM threshold. Also, keep your server's boot time (including firmware initialization) in mind and test your configuration before running in production. Keep in mind that the DPM also uses Active Memory and CPU usage for its decisions. Booting VMs might claim all memory; however, not use many active memory resources. If hosts are powered down while plenty VMs are booting, this might result in extensive swapping. Summary In this article, you learned how to implement the affinity and anti-affinity rules. You have also learned how to save power, while still achieving our workload requirements. Resources for Article: Further resources on this subject: Monitoring and Troubleshooting Networking [article] Storage Scalability [article] Upgrading VMware Virtual Infrastructure Setups [article]
Read more
  • 0
  • 0
  • 2440
article-image-swift-power-and-performance
Packt
12 Oct 2015
14 min read
Save for later

Swift Power and Performance

Packt
12 Oct 2015
14 min read
In this article by Kostiantyn Koval, author of the book, Swift High Performance, we will learn about Swift, its performance and optimization, and how to achieve high performance. (For more resources related to this topic, see here.) Swift speed I could guess you are interested in Swift speed and are probably wondering "How fast can Swift be?" Before we even start learning Swift and discovering all the good things about it, let's answer this right here and right now. Let's take an array of 100,000 random numbers, sort in Swift, Objective-C, and C by using a standard sort function from stdlib (sort for Swift, qsort for C, and compare for Objective-C), and measure how much time would it take. In order to sort an array with 100,000 integer elements, the following are the timings: Swift 0.00600 sec C 0.01396 sec Objective-C 0.08705 sec The winner is Swift! Swift is 14.5 times faster that Objective-C and 2.3 times faster than C. In other examples and experiments, C is usually faster than Objective-C and Swift is way faster. Comparing the speed of functions You know how functions and methods are implemented and how they work. Let's compare the performance and speed of global functions and different method types. For our test, we will use a simple add function. Take a look at the following code snippet: func add(x: Int, y: Int) -> Int { return x + y } class NumOperation { func addI(x: Int, y: Int) -> Int class func addC(x: Int, y: Int) -> Int static func addS(x: Int, y: Int) -> Int } class BigNumOperation: NumOperation { override func addI(x: Int, y: Int) -> Int override class func addC(x: Int, y: Int) -> Int } For the measurement and code analysis, we use a simple loop in which we call those different methods: measure("addC") { var result = 0 for i in 0...2000000000 { result += NumOperation.addC(i, y: i + 1) // result += test different method } print(result) } Here are the results. All the methods perform exactly the same. Even more so, their assembly code looks exactly the same, except the name of the function call: Global function: add(10, y: 11) Static: NumOperation.addS(10, y: 11) Class: NumOperation.addC(10, y: 11) Static subclass: BigNumOperation.addS(10, y: 11) Overridden subclass: BigNumOperation.addC(10, y: 11) Even though the BigNumOperation addC class function overrides the NumOperation addC function when you call it directly, there is no need for a vtable lookup. The instance method call looks a bit different: Instance: let num = NumOperation() num.addI(10, y: 11) Subclass overridden instance: let bigNum = BigNumOperation() bigNum.addI() One difference is that we need to initialize a class and create an instance of the object. In our example, this is not so expensive an operation because we do it outside the loop and it takes place only once. The loop with the calling instance method looks exactly the same. As you can see, there is almost no difference in the global function and the static and class methods. The instance method looks a bit different but it doesn't have any major impact on performance. Also, even though it's true for simple use cases, there is a difference between them in more complex examples. Let's take a look at the following code snippet: let baseNumType = arc4random_uniform(2) == 1 ? BigNumOperation.self : NumOperation.self for i in 0...loopCount { result += baseNumType.addC(i, y: i + 1) } print(result) The only difference we incorporated here is that instead of specifying the NumOperation class type in compile time, we randomly returned it at runtime. And because of this, the Swift compiler doesn't know what method should be called at compile time—BigNumOperation.addC or NumOperation.addC. This small change has an impact on the generated assembly code and performance. A summary of the usage of functions and methods Global functions are the simplest and give the best performance. Too many global functions, however, make the code hard to read and reason. Static type methods, which can't be overridden have the same performance as global functions, but they also provide a namespace (its type name), so our code looks clearer and there is no adverse effect on performance. Class methods, which can be overridden could lead to a decrease in performance, and they should be used when you need class inheritance. In other cases, static methods are preferred. The instance method operates on the instance of the object. Use instance methods when you need to operate on the data of that instance. Make methods final when you don't need to override them. This gives an extra tip for the compiler for optimization, and performance could be increased because of it. Intelligent code Because Swift is a static and strongly typed language, it can read, understand, and optimize code very well. It tries to avoid the execution of all unnecessary code. For a better explanation, let's take a look at this simple example: class Object { func nothing() { } } let object = Object() object.nothing() object.nothing() We create an instance of the Object class and call a nothing method. The nothing method is empty, and calling it does nothing. The Swift compiler understands this and removes those method calls. After this, we have only one line of code: let object = Object() The Swift compiler can also remove the objects created that are never used. It reduces memory usage and unnecessary function calls, which also reduces CPU usage. In our example, the object instance is not used after removing the nothing method call and the creation of object can be removed as well. In this way, Swift removes all three lines of code and we end up with no code to execute at all. Objective-C, in comparison, can't do this optimization. Because it has a dynamic runtime, the nothing method's implementation can be changed to do some work at runtime. That's why Objective-C can't remove empty method calls. This optimization might not seem like a big win but let's take a look at another—a bit more complex—example that uses more memory: class Object { let x: Int let y: Int let z: Int init(x: Int) { self.x = x self.y = x * 2 self.z = y * 2 } func nothing() { } } We have added some Int data to our Object class to increase memory usage. Now, the Object instance would use at least 24 bytes (3 * int size; Int uses 4 bytes in the 64 bit architecture). Let's also try to increase the CPU usage by adding more instructions, using a loop: for i in 0...1_000_000 { let object = Object(x: i) object.nothing() object.nothing() } print("Done") Integer literals can use the underscore sign (_) to improve readability. So, 1_000_000_000 is the same as 1000000000. Now, we have 3 million instructions and we would use 24 million bytes (about 24 MB). This is quite a lot for a type of operation that actually doesn't do anything. As you can see, we don't use the result of the loop body. For the loop body, Swift does the same optimization as in previous example and we end up with an empty loop: for i in 0...1_000_000 { } The empty loop can be skipped as well. As a result, we have saved 24 MB of memory usage and 3 million method calls. Dangerous functions There are some functions and instructions that sometimes don't provide any value for the application but the Swift compiler can't skip them because that could have a very negative impact on performance. Console print Printing a statement to the console is usually used for debugging purposes. The print and debugPrint instructions aren't removed from the application in release mode. Let's explore this code: for i in 0...1_000_000 { print(i) } The Swift compiler treats print and debugPrint as valid and important instructions that can't be skipped. Even though this code does nothing, it can't be optimized, because Swift doesn't remove the print statement. As a result, we have 1 million unnecessary instructions. As you can see, even very simple code that uses the print statement could decrease an application's performance very drastically. The loop with the 1_000_000 print statement takes 5 seconds, and that's a lot. It's even worse if you run it in Xcode; it would take up to 50 seconds. It gets all the more worse if you add a print instruction to the nothing method of an Object class from the previous example: func nothing() { print(x + y + z) } In that case, a loop in which we create an instance of Object and call nothing can't be eliminated because of the print instruction. Even though Swift can't eliminate the execution of that code completely, it does optimization by removing the creation instance of Object and calling the nothing method, and turns it into simple loop operation. The compiled code after optimization looks like this: // Initial Source Code for i in 0...1_000 { let object = Object(x: i) object.nothing() object.nothing() } // Optimized Code var x = 0, y = 0, z = 0 for i in 0...1_000_000 { x = i y = x * 2 z = y * 2 print(x + y + z) print(x + y + z) } As you can see, this code is far from perfect and has a lot of instructions that actually don't give us any value. There is a way to improve this code, so the Swift compiler does the same optimization as without print. Removing print logs To solve this performance problem, we have to remove the print statements from the code before compiling it. There are different ways of doing this. Comment out The first idea is to comment out all print statements of the code in release mode: //print("A") This will work but the next time when you want to enable logs, you will need to uncomment that code. This is a very bad and painful practice. But there is a better solution to it. Commented code is bad practice in general. You should be using a source code version control system, such as Git, instead. In this way, you can safely remove the unnecessary code and find it in the history if you need it later. Using a build configuration We can enable print only in debug mode. To do this, we will use a build configuration to conditionally exclude some code. First, we need to add a Swift compiler custom flag. To do this, select a project target and then go to Build Settings | Other Swift Flags. In the Swift Compiler - Custom Flags section and add the –D DEBUG flag for debug mode, like this: After this, you can use the DEBUG configuration flag to enable code only in debug mode. We will define our own print function. It will generate a print statement only in debug mode. In release mode, this function will be empty, and the Swift compiler will successfully eliminate it: func D_print(items: Any..., separator: String = " ", terminator: String = "n") { #if DEBUG print(items, separator: separator, terminator: terminator) #endif } Now, everywhere instead of print, we will use D_print: func nothing() { D_print(x + y + z) } You can also create a similar D_debugPrint function. Swift is very smart and does a lot of optimization, but we also have to make our code clear for people to read and for the compiler to optimize. Using a preprocessor adds complexity to your code. Use it wisely and only in situations when normal if conditions won't work, for instance, in our D_print example. Improving speed There are a few techniques that can simply improve code performance. Let's proceed directly to the first one. final You can create a function and property declaration with the final attribute. Adding the final attribute makes it non-overridable. The subclasses can't override that method or property. When you make a method non-overridable, there is no need to store it in vtable and the call to that function can be performed directly without any function address lookup in vtable: class Animal { final var name: String = "" final func feed() { } } As you have seen, final methods perform faster than non-final methods. Even such small optimization could improve an application's performance. It not only improves performance but also makes the code more secure. This way, you prevent a method from being overridden and prevent unexpected and incorrect behavior. Enabling the Whole Module Optimization setting would achieve very similar optimization results, but it's better to mark a function and property declaration explicitly as final, which would reduce the compiler's work and speed up the compilation. The compilation time for big projects with Whole Module Optimization could be up to 5 minutes in Xcode 7. Inline functions As you have seen, Swift can do optimization and inline some function calls. This way, there is no performance penalty for calling a function. You can manually enable or disable inline functions with the @inline attribute: @inline(__always) func someFunc () { } @inline(never) func someFunc () { } Even though you can manually control inline functions, it's usually better to leave it to the Swift complier to do this. Depending on the optimization settings, the Swift compiler applies different inlining techniques. The use-case for @inline(__always) would be very simple one-line functions that you always want to be inline. Value objects and reference objects There are many benefits of using immutable value types. Value objects make code not only safer and clearer but also faster. They have better speed and performance than reference objects; here is why. Memory allocation A value object can be allocated in the stack memory instead of the heap memory. Reference objects need to be allocated in the heap memory because they can be shared between many owners. Because value objects have only one owner, they can be allocated safely in the stack. Stack memory is way faster than heap memory. The second advantage is that value objects don't need reference counting memory management. As they can have only one owner, there is no such thing as reference counting for value objects. With Automatic Reference Counting (ARC) we don't think much about memory management, and it mostly looks transparent for us. Even though code looks the same when using reference objects and value objects, ARC adds extra retain and release method calls for reference objects. Avoiding Objective-C In most cases, Objective-C, with its dynamic runtime, performs slower than Swift. The interoperability between Swift and Objective-C is done so seamlessly that sometimes we may use Objective-C types and its runtime in the Swift code without knowing it. When you use Objective-C types in Swift code, Swift actually uses the Objective-C runtime for method dispatch. Because of that, Swift can't do the same optimization as for pure Swift types. Lets take a look at a simple example: for _ in 0...100 { _ = NSObject() } Let's read this code and make some assumptions about how the Swift compiler would optimize it. The NSObject instance is never used in the loop body, so we could eliminate the creation of an object. After that, we will have an empty loop; this can be eliminated as well. So, we remove all of the code from execution, but actually no code gets eliminated. This happens because Objective-C types use dynamic runtime method dispatch, called message sending. All standard frameworks, such as Foundation and UIKit, are written in Objective-C, and all types such as NSDate, NSURL, UIView, and UITableView use the Objective-C runtime. They do not perform as fast as Swift types, but we get all of these frameworks available for usage in Swift, and this is great. There is no way to remove the Objective-C dynamic runtime dispatch from Objective-C types in Swift, so the only thing we can do is learn how to use them wisely. Summary In this article, we covered many powerful features of Swift related to Swift's performance and gave some tips on how to solve performance-related issues. Resources for Article: Further resources on this subject: Flappy Swift[article] Profiling an app[article] Network Development with Swift [article]
Read more
  • 0
  • 0
  • 4029

article-image-performance-design
Packt
15 Sep 2015
9 min read
Save for later

Performance by Design

Packt
15 Sep 2015
9 min read
In this article by Shantanu Kumar, author of the book, Clojure High Performance Programming - Second Edition, we learn how Clojure is a safe, functional programming language that brings great power and simplicity to the user. Clojure is also dynamically and strongly typed, and has very good performance characteristics. Naturally, every activity performed on a computer has an associated cost. What constitutes acceptable performance varies from one use-case and workload to another. In today's world, performance is even the determining factor for several kinds of applications. We will discuss Clojure (which runs on the JVM (Java Virtual Machine)), and its runtime environment in the light of performance, which is the goal of the book. In this article, we will study the basics of performance analysis, including the following: A whirlwind tour of how the application stack impacts performance Classifying the performance anticipations by the use cases types (For more resources related to this topic, see here.) Use case classification The performance requirements and priority vary across the different kinds of use cases. We need to determine what constitutes acceptable performance for the various kinds of use cases. Hence, we classify them to identify their performance model. When it comes to details, there is no sure shot performance recipe of any kind of use case, but it certainly helps to study their general nature. Note that in real life, the use cases listed in this section may overlap with each other. The user-facing software The performance of user facing applications is strongly linked to the user's anticipation. Having a difference of a good number of milliseconds may not be perceptible for the user but at the same time, a wait for more than a few seconds may not be taken kindly. One important element to normalize the anticipation is to engage the user by providing a duration-based feedback. A good idea to deal with such a scenario would be to start the task asynchronously in the background, and poll it from the UI layer to generate duration-based feedback for the user. Another way could be to incrementally render the results to the user to even out the anticipation. Anticipation is not the only factor in user facing performance. Common techniques like staging or precomputation of data, and other general optimization techniques can go a long way to improve the user experience with respect to performance. Bear in mind that all kinds of user facing interfaces fall into this use case category—the Web, mobile web, GUI, command line, touch, voice-operated, gesture...you name it. Computational and data-processing tasks Non-trivial compute intensive tasks demand a proportional amount of computational resources. All of the CPU, cache, memory, efficiency and the parallelizability of the computation algorithms would be involved in determining the performance. When the computation is combined with distribution over a network or reading from/staging to disk, I/O bound factors come into play. This class of workloads can be further subclassified into more specific use cases. A CPU bound computation A CPU bound computation is limited by the CPU cycles spent on executing it. Arithmetic processing in a loop, small matrix multiplication, determining whether a number is a Mersenne prime, and so on, would be considered CPU bound jobs. If the algorithm complexity is linked to the number of iterations/operations N, such as O(N), O(N2) and more, then the performance depends on how big N is, and how many CPU cycles each step takes. For parallelizable algorithms, performance of such tasks may be enhanced by assigning multiple CPU cores to the task. On virtual hardware, the performance may be impacted if the CPU cycles are available in bursts. A memory bound task A memory bound task is limited by the availability and bandwidth of the memory. Examples include large text processing, list processing, and more. For example, specifically in Clojure, the (reduce f (pmap g coll)) operation would be memory bound if coll is a large sequence of big maps, even though we parallelize the operation using pmap here. Note that higher CPU resources cannot help when memory is the bottleneck, and vice versa. Lack of availability of memory may force you to process smaller chunks of data at a time, even if you have enough CPU resources at your disposal. If the maximum speed of your memory is X and your algorithm on single the core accesses the memory at speed X/3, the multicore performance of your algorithm cannot exceed three times the current performance, no matter how many CPU cores you assign to it. The memory architecture (for example, SMP and NUMA) contributes to the memory bandwidth in multicore computers. Performance with respect to memory is also subject to page faults. A cache bound task A task is cache bound when its speed is constrained by the amount of cache available. When a task retrieves values from a small number of repeated memory locations, for example, a small matrix multiplication, the values may be cached and fetched from there. Note that CPUs (typically) have multiple layers of cache, and the performance will be at its best when the processed data fits in the cache, but the processing will still happen, more slowly, when the data does not fit into the cache. It is possible to make the most of the cache using cache-oblivious algorithms. A higher number of concurrent cache/memory bound threads than CPU cores is likely to flush the instruction pipeline, as well as the cache at the time of context switch, likely leading to a severely degraded performance. An input/output bound task An input/output (I/O) bound task would go faster if the I/O subsystem, that it depends on, goes faster. Disk/storage and network are the most commonly used I/O subsystems in data processing, but it can be serial port, a USB-connected card reader, or any I/O device. An I/O bound task may consume very few CPU cycles. Depending on the speed of the device, connection pooling, data compression, asynchronous handling, application caching, and more, may help in performance. One notable aspect of I/O bound tasks is that performance is usually dependent on the time spent waiting for connection/seek, and the amount of serialization that we do, and hardly on the other resources. In practice, many data processing workloads are usually a combination of CPU bound, memory bound, cache bound, and I/O bound tasks. The performance of such mixed workloads effectively depends on the even distribution of CPU, cache, memory, and I/O resources over the duration of the operation. A bottleneck situation arises only when one resource gets too busy to make way for another. Online transaction processing The online transaction processing (OLTP) systems process the business transactions on demand. It can sit behind systems such as a user-facing ATM machine, point-of-sale terminal, a network-connected ticket counter, ERP systems, and more. The OLTP systems are characterized by low latency, availability, and data integrity. They run day-to-day business transactions. Any interruption or outage is likely to have a direct and immediate impact on the sales or service. Such systems are expected to be designed for resiliency rather than the delayed recovery from failures. When the performance objective is unspecified, you may like to consider graceful degradation as a strategy. It is a common mistake to ask the OLTP systems to answer analytical queries; something that they are not optimized for. It is desirable of an informed programmer to know the capability of the system, and suggest design changes as per the requirements. Online analytical processing The online analytical processing (OLAP) systems are designed to answer analytical queries in short time. They typically get data from the OLTP operations, and their data model is optimized for querying. They basically provide for consolidation (roll-up), drill-down and slicing, and dicing of data for analytical purposes. They often use specialized data stores that can optimize the ad-hoc analytical queries on the fly. It is important for such databases to provide pivot-table like capability. Often, the OLAP cube is used to get fast access to the analytical data. Feeding the OLTP data into the OLAP systems may entail workflows and multistage batch processing. The performance concern of such systems is to efficiently deal with large quantities of data, while also dealing with inevitable failures and recovery. Batch processing Batch processing is automated execution of predefined jobs. These are typically bulk jobs that are executed during off-peak hours. Batch processing may involve one or more stages of job processing. Often batch processing is clubbed with work-flow automation, where some workflow steps are executed offline. Many of the batch processing jobs work on staging of data, and on preparing data for the next stage of processing to pick up. Batch jobs are generally optimized for the best utilization of the computing resources. Since there is little to moderate the demand to lower the latencies of some particular subtasks, these systems tend to optimize for throughput. A lot of batch jobs involve largely I/O processing and are often distributed over a cluster. Due to distribution, the data locality is preferred when processing the jobs; that is, the data and processing should be local in order to avoid network latency in reading/writing data. Summary We learned about the basics of what it is like to think more deeply about performance. The performance of Clojure applications depend on various factors. For a given application, understanding its use cases, design and implementation, algorithms, resource requirements and alignment with the hardware, and the underlying software capabilities, is essential. Resources for Article: Further resources on this subject: Big Data [article] The Observer Pattern [article] Working with Incanter Datasets [article]
Read more
  • 0
  • 0
  • 2264

article-image-introduction-and-composition
Packt
19 Aug 2015
17 min read
Save for later

Introduction and Composition

Packt
19 Aug 2015
17 min read
In this article written by Diogo Resende, author of the book Node.js High Performance, we will discuss how high performance is hard, and how it depends on many factors. Best performance should be a constant goal for developers. To achieve it, a developer must know the programming language they use and, more importantly, how the language performs under heavy loads, these being disk, memory, network, and processor usage. (For more resources related to this topic, see here.) Developers will make the most out of a language if they know its weaknesses. In a perfect world, since every job is different, a developer should look for the best tool for the job. But this is not feasible and a developer wouldn't be able to know every best tool, so they have to look for the second best tool for every job. A developer will excel if they know few tools but master them. As a metaphor, a hammer is used to drive nails, and you can also use it to break objects apart or forge metals, but you shouldn't use it to drive screws. The same applies to languages and platforms. Some platforms are very good for a lot of jobs but perform really badly at other jobs. This performance can sometimes be mitigated, but at other times, can't be avoided and you should look for better tools. Node.js is not a language; it's actually a platform built on top of V8, Google's open source JavaScript engine. This engine implements ECMAScript, which itself is a simple and very flexible language. I say "simple" because it has no way of accessing the network, accessing the disk, or talking to other processes. It can't even stop execution since it has no kind of exit instruction. This language needs some kind of interface model on top of it to be useful. Node.js does this by exposing a (preferably) nonblocking I/O model using libuv. This nonblocking API allows you to access the filesystem, connect to network services and execute child processes. The API also has two other important elements: buffers and streams. Since JavaScript strings are Unicode friendly, buffers were introduced to help deal with binary data. Streams are used as simple event interfaces to pass data around. Buffers and streams are used all over the API when reading file contents or receiving network packets. A stream is a module, similar to the network module. When loaded, it provides access to some base classes that help create readable, writable, duplex, and transform streams. These can be used to perform all sorts of data manipulation in a simplified and unified format. The buffers module easily becomes your best friend when converting binary data formats to some other format, for example, JSON. Multiple read and write methods help you convert integers and floats, signed or not, big endian or little endian, from 8 bits to 8 bytes long. Most of the platform is designed to be simple, small, and stable. It's designed and ready to create some high-performance applications. Performance analysis Performance is the amount of work completed in a defined period of time and with a set of defined resources. It can be analyzed using one or more metrics that depend on the performance goal. The goal can be low latency, low memory footprint, reduced processor usage, or even reduced power consumption. The act of performance analysis is also called profiling. Profiling is very important for making optimized applications and is achieved by instrumenting either the source or the instance of the application. By instrumenting the source, developers can spot common performance weak spots. By instrumenting an application instance, they can test the application on different environments. This type of instrumentation can also be known by the name benchmarking. Node.js is known for being fast. Actually, it's not that fast; it's just as fast as your resources allow it. What Node.js is best at is not blocking your application because of an I/O task. The perception of performance can be misleading in Node.js applications. In some other languages, when an application task gets blocked—for example, by a disk operation—all other tasks can be affected. In the case of Node.js, this doesn't happen—usually. Some people look at the platform as being single threaded, which isn't true. Your code runs on a thread, but there are a few more threads responsible for I/O operations. Since these operations are extremely slow compared to the processor's performance, they run on a separate thread and signal the platform when they have information for your application. Applications blocking I/O operations perform poorly. Since Node.js doesn't block I/O unless you want it to, other operations can be performed while waiting for I/O. This greatly improves performance. V8 is an open source Google project and is the JavaScript engine behind Node.js. It's responsible for compiling and executing JavaScript, as well as managing your application's memory needs. It is designed with performance in mind. V8 follows several design principles to improve language performance. The engine has a profiler and one of the best and fast garbage collectors that exist, which is one of the keys to its performance. It also does not compile the language into byte code; it compiles it directly into machine code on the first execution. A good background in the development environment will greatly increase the chances of success in developing high-performance applications. It's very important to know how dereferencing works, or why your variables should avoid switching types. Here are other useful tips you would want to follow. You can use a style guide like JSCS and a linter like JSHint to enforce them to for yourself and your team. Here are some of them: Write small functions, as they're more easily optimized Use monomorphic parameters and variables Prefer arrays to manipulate data, as integer-indexed elements are faster Try to have small objects and avoid long prototype chains Avoid cloning objects because big objects will slow the operations Monitoring After an application is put into production mode, performance analysis becomes even more important, as users will be more demanding than you were. Users don't accept anything that takes more than a second, and monitoring the application's behavior over time and over some specific loads will be extremely important, as it will point to you where your platform is failing or will fail next. Yes, your application may fail, and the best you can do is be prepared. Create a backup plan, have fallback hardware, and create service probes. Essentially, anticipate all the scenarios you can think of, and remember that your application will still fail. Here are some of those scenarios and aspects that you should monitor: When in production, application usage is of extreme importance to understand where your application is heading in terms of data size or memory usage. It's important that you carefully define source code probes to monitor metrics—not only performance metrics, such as requests per second or concurrent requests, but also error rate and exception percentage per request served. Your application emits errors and sometimes throws exceptions; it's normal and you shouldn't ignore them. Don't forget the rest of the infrastructure. If your application must perform at high standards, your infrastructure should too. Your server power supply should be uninterruptible and stable, as instability will degrade your hardware faster than it should. Choose your disks wisely, as faster disks are more expensive and usually come in smaller storage sizes. Sometimes, however, this is actually not a bad decision when your application doesn't need that much storage and speed is considered more important. But don't just look at the gigabytes per dollar. Sometimes, it's more important to look at the gigabits per second per dollar. Also, your server temperature and server room should be monitored. High temperatures degrades performance and your hardware has an operation temperature limit. Security, both physical and virtual, is also very important. Everything counts for the standards of high performance, as an application that stops serving its users is not performing at all. Getting high performance Planning is essential in order to achieve the best results possible. High performance is built from the ground up and starts with how you plan and develop. It obviously depends on physical resources, as you can't perform well when you don't have sufficient memory to accomplish your task, but it also depends greatly on how you plan and develop an application. Mastering tools will give much better performance chances than just using them. Setting the bar high from the beginning of development will force the planning to be more prudent. Some bad planning of the database layer can really downgrade performance. Also, cautious planning will cause developers to think more about “use cases and program more consciously. High performance is when you have to think about a new set of resources (processor, memory, storage) because all that you have is exhausted, not just because one resource is. A high-performance application shouldn't need a second server when a little processor is used and the disk is full. In such a case, you just need bigger disks. Applications can't be designed as monolithic these days. An increasing user base enforces a distributed architecture, or at least one that can distribute load by having multiple instances. This is very important to accommodate in the beginning of the planning, as it will be harder to change an application that is already in production. Most common applications will start performing worse over time, not because of deficit of processing power but because of increasing data size on databases and disks. You'll notice that the importance of memory increases and fallback disks become critical to avoiding downtime. It's very important that an application be able to scale horizontally, whether to shard data across servers or across regions. A distributed architecture also increases performance. Geographically distributed servers can be more closed to clients and give a perception of performance. Also, databases distributed by more servers will handle more traffic as a whole and allow DevOps to accomplish zero downtime goals. This is also very useful for maintenance, as nodes can be brought down for support without affecting the application. Testing and benchmarking To know whether an application performs well or not under specific environments, we have to test it. This kind of test is called a benchmark. Benchmarking is important to do and it's specific to every application. Even for the same language and platform, different applications might perform differently, either because of the way in which some parts of an application were structured or the way in which a database was designed. Analyzing the performance will indicate bottleneck of your application, or if you may, the parts of the application that perform not good as others. These are the parts that need to be improved. Constantly trying to improve the worst performing parts will elevate the application's overall performance. There are plenty of tools out there, some more specific or focused on JavaScript applications, such as benchmarkjs (http://benchmarkjs.com/) and ben (https://github.com/substack/node-ben), and others more generic, such as ab (http://httpd.apache.org/docs/2.2/programs/ab.html) and httpload (https://github.com/perusio/httpload). There are several types of benchmark tests depending on the goal, they are as follows: Load testing is the simplest form of benchmarking. It is done to find out how the application performs under a specific load. You can test and find out how many connections an application accepts per second, or how many traffic bytes an application can handle. An application load can be checked by looking at the external performance, such as traffic, and also internal performance, such as the processor used or the memory consumed. Soak testing is used to see how an application performs during a more extended period of time. It is done when an application tends to degrade over time and analysis is needed to see how it reacts. This type of test is important in order to detect memory leaks, as some applications can perform well in some basic tests, but over time, the memory leaks and their performance can degrade. Spike testing is used when a load is increased very fast to see how the application reacts and performs. This test is very useful and important in applications that can have spike usages, and operators need to know how the application will react. Twitter is a good example of an application environment that can be affected by usage spikes (in world events such as sports or religious dates), and need to know how the infrastructure will handle them. All of these tests can become harder as your application grows. Since your user base gets bigger, your application scales and you lose the ability to be able to load test with the resources you have. It's good to be prepared for this moment, especially to be prepared to monitor performance and keep track of soaks and spikes as your application users start to be the ones responsible for continuously test load. Composition in applications Because of this continuous demand of performant applications, composition becomes very important. Composition is a practice where you split the application into several smaller and simpler parts, making them easier to understand, develop, and maintain. It also makes them easier to test and improve. Avoid creating big, monolithic code bases. They don't work well when you need to make a change, and they also don't work well if you need to test and analyze any part of the code to improve it and make it perform better. The Node.js platform helps you—and in some ways, forces you to—compose your code. Node.js Package Manager (NPM) is a great module publishing service. You can download other people's modules and publish your own as well. There are tens of thousands of modules published, which means that you don't have to reinvent the wheel in most cases. This is good since you can avoid wasting time on creating a module and use a module that is already in production and used by many people, which normally means that bugs will be tracked faster and improvements will be delivered even faster. The Node.js platform allows developers to easily separate code. You don't have to do this, as the platform doesn't force you to, but you should try and follow some good practices, such as the ones described in the following sections. Using NPM Don't rewrite code unless you need to. Take your time to try some available modules, and choose the one that is right for you. This reduces the probability of writing faulty code and helps published modules that have a bigger user base. Bugs will be spotted earlier, and more people in different environments will test fixes. Moreover, you will be using a more resilient module. One important and neglected task after starting to use some modules is to track changes and, whenever possible, keep using recent stable versions. If a dependency module has not been updated for a year, you can spot a problem later, but you will have a hard time figuring out what changed between two versions that are a year apart. Node.js modules tend to be improved over time and API changes are not rare. Always upgrade with caution and don't forget to test. Separating your code Again, you should always split your code into smaller parts. Node.js helps you do this in a very easy way. You should not have files bigger than 5 kB. If you have, you better think about splitting it. Also, as a good rule, each user-defined object should have its own separate file. Name your files accordingly: // MyObject.js module.exports = MyObject; function MyObject() { // … } MyObject.prototype.myMethod = function () { … }; Another good rule to check whether you have a file bigger than it should be; that is, it should be easy to read and understand in less than 5 minutes by someone new to the application. If not, it means that it's too complex and it will be harder to track and fix bugs later on. Remember that later on, when your application becomes huge, you will be like a new developer when opening a file to fix something. You can't remember all of the code of the application, and you need to absorb a file behavior fast. Garbage collection When writing applications, managing available memory is boring and difficult. When the application gets complex it's easy to start leaking memory. Many programming languages have automatic memory management, removing this management away from the developer by means of a Garbage Collector (GC). The GC is only a part of this memory management, but it's the most important one and is responsible for reclaiming memory that is no longer in use (garbage), by periodically looking at disposed referenced objects and freeing the memory associated with them. The most common technique used by GC is to monitor reference counting. This means that GC, for each object, holds the number (count) of other objects that reference it. When an object has no references to it, it can be collected, meaning, it can be disposed and its memory freed. CPU Profiling Profiling is boring but it's a good form of software analysis where you measure resource usage. This usage is measured over time and sometimes under specific workloads. Resources can mean anything the application is using, being it memory, disk, network or processor. More specifically, CPU profiling allows one to analyze how and how much your functions use the processor. You can also analyze the opposite, the non-usage of the processor, the idle time. When profiling the processor, we usually take samples of the call stack at a certain frequency and analyze how the stack changes (increases or decreases) over the sampling period. If we use profilers from the operating system we'll have more items in stack than you probably expect, as you'll get internal calls of node.js and V8. Summary Together, Node.js and NPM make a very good platform for developing high-performance applications. Since the language behind them is JavaScript and most applications these days are web applications, these combinations make it an even more appealing choice, as it's one less server-side language to learn (such as PHP or Ruby) and can ultimately allow a developer to share code on the client and server sides. Also, frontend and backend developers can share, read, and improve each other's code. Many developers pick this formula and bring with them many of their habits from the client side. Some of these habits are not applicable because on the server-side, asynchronous tasks must rule as there are many clients connected (as opposed to one) and performance becomes crucial. You now see how the garbage collector task is not that easy. It surely does a very good job managing memory automatically. You can help it a lot, especially if writing applications with performance in mind. Avoiding the GC old space growing is necessary to avoid long GC cycles, pausing your application and sometimes forcing your services to restart. Every time you create a new variable you allocate memory and you get closer to a new GC cycle. Even understanding how memory is managed, sometimes you need to inspect your memory usage behavior In environments seen nowadays, it's very important to be able to profile an application to identify bottlenecks, especially at the processor and memory level. Overall you should focus on your code quality before going into profiling. Resources for Article: Further resources on this subject: Node.js Fundamentals [Article] So, what is Node.js? [Article] Setting up Node [Article]
Read more
  • 0
  • 0
  • 1425
article-image-optimization-python
Packt
19 Aug 2015
14 min read
Save for later

Optimization in Python

Packt
19 Aug 2015
14 min read
The path to mastering performance in Python has just started. Profiling only takes us half way there. Measuring how our program is using the resources at its disposal only tells us where the problem is, not how to fix it. In this article by Fernando Doglio, author of the book Mastering Python High Performance, we will cover the process of optimization, and to do that, we need to start with the basics. We'll keep it inside the language for now: no external tools, just Python and the right way to use it. We will cover the following topics in this article: Memoization / lookup tables Usage of default arguments (For more resources related to this topic, see here.) Memoization / lookup tables This is one of the most common techniques used to improve the performance of a piece of code (namely a function). We can save the results of expensive function calls associated to a specific set of input values and return the saved result (instead of redoing the whole computation) when the function is called with the remembered input. It might be confused with caching, since it is one case of it, although this term refers also, to other types of optimization (such as HTTP caching, buffering, and so on) This methodology is very powerful, because in practice, it'll turn what should have been a potentially very expensive call into a O(1) function call if the implementation is right. Normally, the parameters are used to create a unique key, which is then used on a dictionary to either save the result or obtain it if it's been already saved. There is, of course, a trade-off to this technique. If we're going to be remembering the returned values of a memoized function, then we'll be exchanging memory space for speed. This is a very acceptable trade-off, unless the saved data becomes more than what the system can handle. Classic use cases for this optimization are function calls that repeat the input parameters often. This will assure that most of the times, the memoized results are returned. If there are many function calls but with different parameters, we'll only store results and spend our memory without any real benefit, as seen in the following diagram: You can clearly see how the blue bar (Fixed params, memoized) is clearly the fastest use case, while the others are all similar due to their nature. Here is the code that generates values for the preceding chart. To generate some sort of time-consuming function, the code will call either the twoParams function or the twoParamsMemoized function several hundred times under different conditions, and it will log the execution time: import math import time import random class Memoized: def __init__(self, fn): self.fn = fn self.results = {} def __call__(self, *args): key = ''.join(map(str, args[0])) try: return self.results[key] except KeyError: self.results[key] = self.fn(*args) return self.results[key] @Memoized def twoParamsMemoized(values, period): totalSum = 0 for x in range(0, 100): for v in values: totalSum = math.pow((math.sqrt(v) * period), 4) + totalSum return totalSum def twoParams(values, period): totalSum = 0 for x in range(0, 100): for v in values: totalSum = math.pow((math.sqrt(v) * period), 4) + totalSum return totalSum def performTest(): valuesList = [] for i in range(0, 10): valuesList.append(random.sample(xrange(1, 101), 10)) start_time = time.clock() for x in range(0, 10): for values in valuesList: twoParamsMemoized(values, random.random()) end_time = time.clock() - start_time print "Fixed params, memoized: %s" % (end_time) start_time = time.clock() for x in range(0, 10): for values in valuesList: twoParams(values, random.random()) end_time = time.clock() - start_time print "Fixed params, without memoizing: %s" % (end_time) start_time = time.clock() for x in range(0, 10): for values in valuesList: twoParamsMemoized(random.sample(xrange(1,2000), 10), random.random()) end_time = time.clock() - start_time print "Random params, memoized: %s" % (end_time) start_time = time.clock() for x in range(0, 10): for values in valuesList: twoParams(random.sample(xrange(1,2000), 10), random.random()) end_time = time.clock() - start_time print "Random params, without memoizing: %s" % (end_time) performTest() The main insight to take from the preceding chart is that just like with every aspect of programming, there is no silver bullet algorithm that will work for all cases. Memoization is clearly a very basic way of optimizing code, but clearly, it won't optimize anything given the right circumstances. As for the code, there is not much to it. It is a very simple, non real-world example of the point I was trying to send across. The performTest function will take care of running a series of 10 tests for every use case and measure the total time each use case takes. Notice that we're not really using profilers at this point. We're just measuring time in a very basic and ad-hoc way, which works for us. The input for both functions is simply a set of numbers on which they will run some math functions, just for the sake of doing something. The other interesting bit about the arguments is that, since the first argument is a list of numbers, we can't just use the args parameter as a key inside the Memoized class' methods. This is why we have the following line: key = ''.join(map(str, args[0])) This line will concatenate all the numbers from the first parameter into a single value, which will act as the key. The second parameter is not used here, because it's always random, which would imply that the key would never be the same. Another variation of the preceding method is to pre-calculate all values from the function (assuming we have a limited number of inputs of course) during initialization and then refer to the lookup table during execution. This approach has several preconditions: The number of input values must be finite; otherwise it's impossible to precalculate everything The lookup table with all of its values must fit into memory Just like before, the input must be repeated, at least once, so the optimization both makes sense and is worth the extra effort There are different approaches when it comes to architecting the lookup table, all offering different types of optimizations. It all depends on the type of application and solution that you're trying to optimize. Here is a set of examples. Lookup on a list or linked list This solution works by iterating over an unsorted list and checking the key against each element, with the associated value as the result we're looking for. This is obviously a very slow method of implementation, with a big O notation of O(n) for both the average and worst case scenarios. Still, given the right circumstances, it could prove to be faster than calling the actual function every time. In this case, using a linked list would improve the performance of the algorithm over using a simple list. However, it would still depend heavily on the type of linked list it is (doubly linked list, simple linked list with direct access to the first and last elements, and so on). Simple lookup on a dictionary This method works using a one-dimensional dictionary lookup, indexed by a key consisting of the input parameters (enough of them create a unique key). In particular cases (like we covered earlier), this is probably one of the fastest lookups, even faster than binary search in some cases with a constant execution time (big O notation of O(1)). Note that this approach is efficient as long as the key-generation algorithm is capable of generating unique keys every time. Otherwise, the performance could degrade over time due to the many collisions on the dictionaries. Binary search This particular method is only possible if the list is sorted. This could potentially be an option depending on the values to sort. Yet, sorting them would require an extra effort that would hurt the performance of the entire effort. However, it presents very good results even in long lists (average big O notation of O(log n)). It works by determining in which half of the list the value is and repeating until either the value is found or the algorithm is able to determine that the value is not in the list. To put all of this into perspective, looking at the Memoized class mentioned earlier, it implements a simple lookup on a dictionary. However, this would be the place to implement either of the other algorithms. Use cases for lookup tables There are some classic example use cases for this type of optimization, but the most common one is probably the optimization of trigonometric functions. Based on the computing time, these functions are really slow. When used repeatedly, they can cause some serious damage to your program's performance. This is why it is normally recommended to precalculate the values of these functions. For functions that deal with an infinite domain universe of possible input values, this task becomes impossible. So, the developer is forced to sacrifice accuracy for performance by precalculating a discrete subdomain of the possible input values (that is, going from floating points down to integer numbers). This approach might not be ideal in some cases, since some systems require both performance and accuracy. So, the solution is to meet in the middle and use some form of interpolation to calculate the wanted value, based on the ones that have been precalculated. It will provide better accuracy. Even though it won't be as performant as using the lookup table directly, it should prove to be faster than doing the trigonometric calculation every time. Let's look at some examples of this, for instance, for the following trigonometric function: def complexTrigFunction(x): return math.sin(x) * math.cos(x)**2 We'll take a look at how simple precalculation won't be accurate enough and how some form of interpolation will result in a better level of accuracy. The following code will precalculate the values for the function on a range from -1000 to 1000 (only integer values). Then, it'll try to do the same calculation (only for a smaller range) for floating point numbers: import math import time from collections import defaultdict import itertools trig_lookup_table = defaultdict(lambda: 0) def drange(start, stop, step): assert(step != 0) sample_count = math.fabs((stop - start) / step) return itertools.islice(itertools.count(start, step), sample_count) def complexTrigFunction(x): return math.sin(x) * math.cos(x)**2 def lookUpTrig(x): return trig_lookup_table[int(x)] for x in range(-1000, 1000): trig_lookup_table[x] = complexTrigFunction(x) trig_results = [] lookup_results = [] init_time = time.clock() for x in drange(-100, 100, 0.1): trig_results.append(complexTrigFunction(x)) print "Trig results: %s" % (time.clock() - init_time) init_time = time.clock() for x in drange(-100, 100, 0.1): lookup_results.append(lookUpTrig(x)) print "Lookup results: %s" % (time.clock() - init_time) for idx in range(0, 200): print "%st%s" % (trig_results [idx], lookup_results[idx]) The results from the preceding code will help demonstrate how the simple lookup table approach is not accurate enough (see the following chart). However, it compensates for it on speed since the original function takes 0.001526 seconds to run while the lookup table only takes 0.000717 seconds. The preceding chart shows how the lack of interpolation hurts the accuracy. You can see how even though both plots are quite similar, the results from the lookup table execution aren't as accurate as the trig function used directly. So, now, let's take another look at the same problem. However, this time, we'll add some basic interpolation (we'll limit the rage of values from -PI to PI): import math import time from collections import defaultdict import itertools trig_lookup_table = defaultdict(lambda: 0) def drange(start, stop, step): assert(step != 0) sample_count = math.fabs((stop - start) / step) return itertools.islice(itertools.count(start, step), sample_count) def complexTrigFunction(x): return math.sin(x) * math.cos(x)**2 reverse_indexes = {} for x in range(-1000, 1000): trig_lookup_table[x] = complexTrigFunction(math.pi * x / 1000) complex_results = [] lookup_results = [] init_time = time.clock() for x in drange(-10, 10, 0.1): complex_results .append(complexTrigFunction(x)) print "Complex trig function: %s" % (time.clock() - init_time) init_time = time.clock() factor = 1000 / math.pi for x in drange(-10 * factor, 10 * factor, 0.1 * factor): lookup_results.append(trig_lookup_table[int(x)]) print "Lookup results: %s" % (time.clock() - init_time) for idx in range(0, len(lookup_results )): print "%st%s" % (complex_results [idx], lookup_results [idx]) As you might've noticed in the previous chart, the resulting plot is periodic (specially because we've limited the range from -PI to PI). So, we'll focus on a particular range of values that will generate one single segment of the plot. The output of the preceding script also shows that the interpolation solution is still faster than the original trigonometric function, although not as fast as it was earlier: Interpolation Solution Original function 0.000118 seconds 0.000343 seconds The following chart is a bit different from the previous one, especially because it shows (in green) the error percentage between the interpolated value and the original one: The biggest error we have is around 12 percent (which represents the peaks we see on the chart). However, it's for the smallest values, such as -0.000852248551417 versus -0.000798905501416. This is a case where the error percentage needs to be contextualized to see if it really matters. In our case, since the values related to that error are so small, we can ignore that error in practice. There are other use cases for lookup tables, such as in image processing. However, for the sake of this article, the preceding example should be enough to demonstrate their benefits and trade-off implied in their usage. Usage of default arguments Another optimization technique, one that is contrary to memoization, is not particularly generic. Instead, it is directly tied to how the Python interpreter works. Default arguments can be used to determine values once at function creation time instead of at run time. This can only be done for functions or objects that will not be changed during program execution. Let's look at an example of how this optimization can be applied. The following code shows two versions of the same function, which does some random trigonometric calculation: import math #original function def degree_sin(deg): return math.sin(deg * math.pi / 180.0) #optimized function, the factor variable is calculated during function creation time, #and so is the lookup of the math.sin method. def degree_sin(deg, factor=math.pi/180.0, sin=math.sin): return sin(deg * factor) This optimization can be problematic if not correctly documented. Since it uses attributes to precompute terms that should not change during the program's execution, it could lead to the creation of a confusing API. With a quick and simple test, we can double-check the performance gain from this optimization: import time import math def degree_sin(deg): return math.sin(deg * math.pi / 180.0) * math.cos(deg * math.pi / 180.0) def degree_sin_opt(deg, factor=math.pi/180.0, sin=math.sin, cos = math.cos): return sin(deg * factor) * cos(deg * factor) normal_times = [] optimized_times = [] for y in range(100): init = time.clock() for x in range(1000): degree_sin(x) normal_times.append(time.clock() - init) init = time.clock() for x in range(1000): degree_sin_opt(x) optimized_times.append(time.clock() - init) print "Normal function: %s" % (reduce(lambda x, y: x + y, normal_times, 0) / 100) print "Optimized function: %s" % (reduce(lambda x, y: x + y, optimized_times, 0 ) / 100) The preceding code measures the time it takes for the script to finish each of the versions of the function to run its code 1000 times. It saves those measurements, and finally, it does an average for each case. The result is displayed in the following chart: It clearly isn't an amazing optimization. However, it does shave off some microseconds from our execution time, so we'll keep it in mind. Just remember that this optimization could cause problems if you're working as part of an OS developer team. Summary In this article, we covered several optimization techniques. Some of them are meant to provide big boosts on speed, save memory. Some of them are just meant to provide minor speed improvements. Most of this article covered Python-specific techniques, but some of them can be translated into other languages as well. Resources for Article: Further resources on this subject: How to do Machine Learning with Python [article] The Essentials of Working with Python Collections [article] Symbolizers [article]
Read more
  • 0
  • 0
  • 3997

article-image-fine-tune-nginx-configufine-tune-nginx-configurationfine-tune-nginx-configurationratio
Packt
14 Jul 2015
20 min read
Save for later

Fine-tune the NGINX Configuration

Packt
14 Jul 2015
20 min read
In this article by Rahul Sharma, author of the book NGINX High Performance, we will cover the following topics: NGINX configuration syntax Configuring NGINX workers Configuring NGINX I/O Configuring TCP Setting up the server (For more resources related to this topic, see here.) NGINX configuration syntax This section aims to cover it in good detail. The complete configuration file has a logical structure that is composed of directives grouped into a number of sections. A section defines the configuration for a particular NGINX module, for example, the http section defines the configuration for the ngx_http_core module. An NGINX configuration has the following syntax: Valid directives begin with a variable name and then state an argument or series of arguments separated by spaces. All valid directives end with a semicolon (;). Sections are defined with curly braces ({}). Sections can be nested in one another. The nested section defines a module valid under the particular section, for example, the gzip section under the http section. Configuration outside any section is part of the NGINX global configuration. The lines starting with the hash (#) sign are comments. Configurations can be split into multiple files, which can be grouped using the include directive. This helps in organizing code into logical components. Inclusions are processed recursively, that is, an include file can further have include statements. Spaces, tabs, and new line characters are not part of the NGINX configuration. They are not interpreted by the NGINX engine, but they help to make the configuration more readable. Thus, the complete file looks like the following code: #The configuration begins here global1 value1; #This defines a new section section { sectionvar1 value1; include file1;    subsection {    subsectionvar1 value1; } } #The section ends here global2 value2; # The configuration ends here NGINX provides the -t option, which can be used to test and verify the configuration written in the file. If the file or any of the included files contains any errors, it prints the line numbers causing the issue: $ sudo nginx -t This checks the validity of the default configuration file. If the configuration is written in a file other than the default one, use the -c option to test it. You cannot test half-baked configurations, for example, you defined a server section for your domain in a separate file. Any attempt to test such a file will throw errors. The file has to be complete in all respects. Now that we have a clear idea of the NGINX configuration syntax, we will try to play around with the default configuration. This article only aims to discuss the parts of the configuration that have an impact on performance. The NGINX catalog has large number of modules that can be configured for some purposes. This article does not try to cover all of them as the details are beyond the scope of the book. Please refer to the NGINX documentation at http://nginx.org/en/docs/ to know more about the modules. Configuring NGINX workers NGINX runs a fixed number of worker processes as per the specified configuration. In the following sections, we will work with NGINX worker parameters. These parameters are mostly part of the NGINX global context. worker_processes The worker_processes directive controls the number of workers: worker_processes 1; The default value for this is 1, that is, NGINX runs only one worker. The value should be changed to an optimal value depending on the number of cores available, disks, network subsystem, server load, and so on. As a starting point, set the value to the number of cores available. Determine the number of cores available using lscpu: $ lscpu Architecture:     x86_64 CPU op-mode(s):   32-bit, 64-bit Byte Order:     Little Endian CPU(s):       4 The same can be accomplished by greping out cpuinfo: $ cat /proc/cpuinfo | grep 'processor' | wc -l Now, set this value to the parameter: # One worker per CPU-core. worker_processes 4; Alternatively, the directive can have auto as its value. This determines the number of cores and spawns an equal number of workers. When NGINX is running with SSL, it is a good idea to have multiple workers. SSL handshake is blocking in nature and involves disk I/O. Thus, using multiple workers leads to improved performance. accept_mutex Since we have configured multiple workers in NGINX, we should also configure the flags that impact worker selection. The accept_mutex parameter available under the events section will enable each of the available workers to accept new connections one by one. By default, the flag is set to on. The following code shows this: events { accept_mutex on; } If the flag is turned to off, all of the available workers will wake up from the waiting state, but only one worker will process the connection. This results in the Thundering Herd phenomenon, which is repeated a number of times per second. The phenomenon causes reduced server performance as all the woken-up workers take up CPU time before going back to the wait state. This results in unproductive CPU cycles and nonutilized context switches. accept_mutex_delay When accept_mutex is enabled, only one worker, which has the mutex lock, accepts connections, while others wait for their turn. The accept_mutex_delay corresponds to the timeframe for which the worker would wait, and after which it tries to acquire the mutex lock and starts accepting new connections. The directive is available under the events section with a default value of 500 milliseconds. The following code shows this: events{ accept_mutex_delay 500ms; } worker_connections The next configuration to look at is worker_connections, with a default value of 512. The directive is present under the events section. The directive sets the maximum number of simultaneous connections that can be opened by a worker process. The following code shows this: events{    worker_connections 512; } Increase worker_connections to something like 1,024 to accept more simultaneous connections. The value of worker_connections does not directly translate into the number of clients that can be served simultaneously. Each browser opens a number of parallel connections to download various components that compose a web page, for example, images, scripts, and so on. Different browsers have different values for this, for example, IE works with two parallel connections while Chrome opens six connections. The number of connections also includes sockets opened with the upstream server, if any. worker_rlimit_nofile The number of simultaneous connections is limited by the number of file descriptors available on the system as each socket will open a file descriptor. If NGINX tries to open more sockets than the available file descriptors, it will lead to the Too many opened files message in the error.log. Check the number of file descriptors using ulimit: $ ulimit -n Now, increase this to a value more than worker_process * worker_connections. The value should be increased for the user that runs the worker process. Check the user directive to get the username. NGINX provides the worker_rlimit_nofile directive, which can be an alternative way of setting the available file descriptor rather modifying ulimit. Setting the directive will have a similar impact as updating ulimit for the worker user. The value of this directive overrides the ulimit value set for the user. The directive is not present by default. Set a large value to handle large simultaneous connections. The following code shows this: worker_rlimit_nofile 20960; To determine the OS limits imposed on a process, read the file /proc/$pid/limits. $pid corresponds to the PID of the process. multi_accept The multi_accept flag enables an NGINX worker to accept as many connections as possible when it gets the notification of a new connection. The purpose of this flag is to accept all connections in the listen queue at once. If the directive is disabled, a worker process will accept connections one by one. The following code shows this: events{    multi_accept on; } The directive is available under the events section with the default value off. If the server has a constant stream of incoming connections, enabling multi_accept may result in a worker accepting more connections than the number specified in worker_connections. The overflow will lead to performance loss as the previously accepted connections, part of the overflow, will not get processed. use NGINX provides several methods for connection processing. Each of the available methods allows NGINX workers to monitor multiple socket file descriptors, that is, when there is data available for reading/writing. These calls allow NGINX to process multiple socket streams without getting stuck in any one of them. The methods are platform-dependent, and the configure command, used to build NGINX, selects the most efficient method available on the platform. If we want to use other methods, they must be enabled first in NGINX. The use directive allows us to override the default method with the method specified. The directive is part of the events section: events { use select; } NGINX supports the following methods of processing connections: select: This is the standard method of processing connections. It is built automatically on platforms that lack more efficient methods. The module can be enabled or disabled using the --with-select_module or --without-select_module configuration parameter. poll: This is the standard method of processing connections. It is built automatically on platforms that lack more efficient methods. The module can be enabled or disabled using the --with-poll_module or --without-poll_module configuration parameter. kqueue: This is an efficient method of processing connections available on FreeBSD 4.1, OpenBSD 2.9+, NetBSD 2.0, and OS X. There are the additional directives kqueue_changes and kqueue_events. These directives specify the number of changes and events that NGINX will pass to the kernel. The default value for both of these is 512. The kqueue method will ignore the multi_accept directive if it has been enabled. epoll: This is an efficient method of processing connections available on Linux 2.6+. The method is similar to the FreeBSD kqueue. There is also the additional directive epoll_events. This specifies the number of events that NGINX will pass to the kernel. The default value for this is 512. /dev/poll: This is an efficient method of processing connections available on Solaris 7 11/99+, HP/UX 11.22+, IRIX 6.5.15+, and Tru64 UNIX 5.1A+. This has the additional directives, devpoll_events and devpoll_changes. The directives specify the number of changes and events that NGINX will pass to the kernel. The default value for both of these is 32. eventport: This is an efficient method of processing connections available on Solaris 10. The method requires necessary security patches to avoid kernel crash issues. rtsig: Real-time signals is a connection processing method available on Linux 2.2+. The method has some limitations. On older kernels, there is a system-wide limit of 1,024 signals. For high loads, the limit needs to be increased by setting the rtsig-max parameter. For kernel 2.6+, instead of the system-wide limit, there is a limit on the number of outstanding signals for each process. NGINX provides the worker_rlimit_sigpending parameter to modify the limit for each of the worker processes: worker_rlimit_sigpending 512; The parameter is part of the NGINX global configuration. If the queue overflows, NGINX drains the queue and uses the poll method to process the unhandled events. When the condition is back to normal, NGINX switches back to the rtsig method of connection processing. NGINX provides the rtsig_overflow_events, rtsig_overflow_test, and rtsig_overflow_threshold parameters to control how a signal queue is handled on overflows. The rtsig_overflow_events parameter defines the number of events passed to poll. The rtsig_overflow_test parameter defines the number of events handled by poll, after which NGINX will drain the queue. Before draining the signal queue, NGINX will look up how much it is filled. If the factor is larger than the specified rtsig_overflow_threshold, it will drain the queue. The rtsig method requires accept_mutex to be set. The method also enables the multi_accept parameter. Configuring NGINX I/O NGINX can also take advantage of the Sendfile and direct I/O options available in the kernel. In the following sections, we will try to configure parameters available for disk I/O. Sendfile When a file is transferred by an application, the kernel first buffers the data and then sends the data to the application buffers. The application, in turn, sends the data to the destination. The Sendfile method is an improved method of data transfer, in which data is copied between file descriptors within the OS kernel space, that is, without transferring data to the application buffers. This results in improved utilization of the operating system's resources. The method can be enabled using the sendfile directive. The directive is available for the http, server, and location sections. http{ sendfile on; } The flag is set to off by default. Direct I/O The OS kernel usually tries to optimize and cache any read/write requests. Since the data is cached within the kernel, any subsequent read request to the same place will be much faster because there's no need to read the information from slow disks. Direct I/O is a feature of the filesystem where reads and writes go directly from the applications to the disk, thus bypassing all OS caches. This results in better utilization of CPU cycles and improved cache effectiveness. The method is used in places where the data has a poor hit ratio. Such data does not need to be in any cache and can be loaded when required. It can be used to serve large files. The directio directive enables the feature. The directive is available for the http, server, and location sections: location /video/ { directio 4m; } Any file with size more than that specified in the directive will be loaded by direct I/O. The parameter is disabled by default. The use of direct I/O to serve a request will automatically disable Sendfile for the particular request. Direct I/O depends on the block size while doing a data transfer. NGINX has the directio_alignment directive to set the block size. The directive is present under the http, server, and location sections: location /video/ { directio 4m; directio_alignment 512; } The default value of 512 bytes works well for all boxes unless it is running a Linux implementation of XFS. In such a case, the size should be increased to 4 KB. Asynchronous I/O Asynchronous I/O allows a process to initiate I/O operations without having to block or wait for it to complete. The aio directive is available under the http, server, and location sections of an NGINX configuration. Depending on the section, the parameter will perform asynchronous I/O for the matching requests. The parameter works on Linux kernel 2.6.22+ and FreeBSD 4.3. The following code shows this: location /data { aio on; } By default, the parameter is set to off. On Linux, aio needs to be enabled with directio, while on FreeBSD, sendfile needs to be disabled for aio to take effect. If NGINX has not been configured with the --with-file-aio module, any use of the aio directive will cause the unknown directive aio error. The directive has a special value of threads, which enables multithreading for send and read operations. The multithreading support is only available on the Linux platform and can only be used with the epoll, kqueue, or eventport methods of processing requests. In order to use the threads value, configure multithreading in the NGINX binary using the --with-threads option. Post this, add a thread pool in the NGINX global context using the thread_pool directive. Use the same pool in the aio configuration: thread_pool io_pool threads=16; http{ ….....    location /data{      sendfile   on;      aio       threads=io_pool;    } } Mixing them up The three directives can be mixed together to achieve different objectives on different platforms. The following configuration will use sendfile for files with size smaller than what is specified in directio. Files served by directio will be read using asynchronous I/O: location /archived-data/{ sendfile on; aio on; directio 4m; } The aio directive has a sendfile value, which is available only on the FreeBSD platform. The value can be used to perform Sendfile in an asynchronous manner: location /archived-data/{ sendfile on; aio sendfile; } NGINX invokes the sendfile() system call, which returns with no data in the memory. Post this, NGINX initiates data transfer in an asynchronous manner. Configuring TCP HTTP is an application-based protocol, which uses TCP as the transport layer. In TCP, data is transferred in the form of blocks known as TCP packets. NGINX provides directives to alter the behavior of the underlying TCP stack. These parameters alter flags for an individual socket connection. TCP_NODELAY TCP/IP networks have the "small packet" problem, where single-character messages can cause network congestion on a highly loaded network. Such packets are 41 bytes in size, where 40 bytes are for the TCP header and 1 byte has useful information. These small packets have huge overhead, around 4000 percent and can saturate a network. John Nagle solved the problem (Nagle's algorithm) by not sending the small packets immediately. All such packets are collected for some amount of time and then sent in one go as a single packet. This results in improved efficiency of the underlying network. Thus, a typical TCP/IP stack waits for up to 200 milliseconds before sending the data packages to the client. It is important to note that the problem exists with applications such as Telnet, where each keystroke is sent over wire. The problem is not relevant to a web server, which severs static files. The files will mostly form full TCP packets, which can be sent immediately instead of waiting for 200 milliseconds. The TCP_NODELAY option can be used while opening a socket to disable Nagle's buffering algorithm and send the data as soon as it is available. NGINX provides the tcp_nodelay directive to enable this option. The directive is available under the http, server, and location sections of an NGINX configuration: http{ tcp_nodelay on; } The directive is enabled by default. NGINX use tcp_nodelay for connections with the keep-alive mode. TCP_CORK As an alternative to Nagle's algorithm, Linux provides the TCP_CORK option. The option tells the TCP stack to append packets and send them when they are full or when the application instructs to send the packet by explicitly removing TCP_CORK. This results in an optimal amount of data packets being sent and, thus, improves the efficiency of the network. The TCP_CORK option is available as the TCP_NOPUSH flag on FreeBSD and Mac OS. NGINX provides the tcp_nopush directive to enable TCP_CORK over the connection socket. The directive is available under the http, server, and location sections of an NGINX configuration: http{ tcp_nopush on; } The directive is disabled by default. NGINX uses tcp_nopush for requests served with sendfile. Setting them up The two directives discussed previously do mutually exclusive things; the former makes sure that the network latency is reduced, while the latter tries to optimize the data packets sent. An application should set both of these options to get efficient data transfer. Enabling tcp_nopush along with sendfile makes sure that while transferring a file, the kernel creates the maximum amount of full TCP packets before sending them over wire. The last packet(s) can be partial TCP packets, which could end up waiting with TCP_CORK being enabled. NGINX make sure it removes TCP_CORK to send these packets. Since tcp_nodelay is also set then, these packets are immediately sent over the network, that is, without any delay. Setting up the server The following configuration sums up all the changes proposed in the preceding sections: worker_processes 3; worker_rlimit_nofile 8000;   events { multi_accept on; use epoll; worker_connections 1024; }   http { sendfile on; aio on; directio 4m; tcp_nopush on; tcp_nodelay on; # Rest Nginx configuration removed for brevity } It is assumed that NGINX runs on a quad core server. Thus three worker processes have been spanned to take advantage of three out of four available cores and leaving one core for other processes. Each of the workers has been configured to work with 1,024 connections. Correspondingly, the nofile limit has been increased to 8,000. By default, all worker processes operate with mutex; thus, the flag has not been set. Each worker processes multiple connections in one go using the epoll method. In the http section, NGINX has been configured to serve files larger than 4 MB using direct I/O, while efficiently buffering smaller files using Sendfile. TCP options have also been set up to efficiently utilize the available network. Measuring gains It is time to test the changes and make sure that they have given performance gain. Run a series of tests using Siege/JMeter to get new performance numbers. The tests should be performed with the same configuration to get a comparable output: $ siege -b -c 790 -r 50 -q http://192.168.2.100/hello   Transactions:               79000 hits Availability:               100.00 % Elapsed time:               24.25 secs Data transferred:           12.54 MB Response time:             0.20 secs Transaction rate:           3257.73 trans/sec Throughput:                 0.52 MB/sec Concurrency:               660.70 Successful transactions:   39500 Failed transactions:       0 Longest transaction:       3.45 Shortest transaction:       0.00 The results from Siege should be evaluated and compared to the baseline. Throughput: The transaction rate defines this as 3250 requests/second Error rate: Availability is reported as 100 percent; thus; the error rate is 0 percent Response time: The results shows a response time of 0.20 seconds Thus, these new numbers demonstrate performance improvement in various respects. After the server configuration is updated with all the changes, reperform all tests with increased numbers. The aim should be to determine the new baseline numbers for the updated configuration. Summary The article started with an overview of the NGINX configuration syntax. Going further, we discussed worker_connections and the related parameters. These allow you to take advantage of the available hardware. The article also talked about the different event processing mechanisms available on different platforms. The configuration discussed helped in processing more requests, thus improving the overall throughput. NGINX is primarily a web server; thus, it has to serve all kinds static content. Large files can take advantage of direct I/O, while smaller content can take advantage of Sendfile. The different disk modes make sure that we have an optimal configuration to serve the content. In the TCP stack, we discussed the flags available to alter the default behavior of TCP sockets. The tcp_nodelay directive helps in improving latency. The tcp_nopush directive can help in efficiently delivering the content. Both these flags lead to improved response time. In the last part of the article, we applied all the changes to our server and then did performance tests to determine the effectiveness of the changes done. In the next article, we will try to configure buffers, timeouts, and compression to improve the utilization of the available network. Resources for Article: Further resources on this subject: Using Nginx as a Reverse Proxy [article] Nginx proxy module [article] Introduction to nginx [article]
Read more
  • 0
  • 0
  • 14504