Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Design Patterns

32 Articles
Packt
12 Jan 2016
11 min read
Save for later

Façade Pattern – Being Adaptive with Façade

Packt
12 Jan 2016
11 min read
In this article by Chetan Giridhar, author of the book, Learning Python Design Patterns - Second Edition, we will get introduced to the Façade design pattern and how it is used in software application development. We will work with a sample use case and implement it in Python v3.5. In brief, we will cover the following topics in this article: An understanding of the Façade design pattern with a UML diagram A real-world use case with the Python v3.5 code implementation The Façade pattern and principle of least knowledge (For more resources related to this topic, see here.) Understanding the Façade design pattern Façade is generally referred to as the face of the building, especially an attractive one. It can be also referred to as a behavior or appearance that gives a false idea of someone's true feelings or situation. When people walk past a façade, they can appreciate the exterior face but aren't aware of the complexities of the structure within. This is how a façade pattern is used. Façade hides the complexities of the internal system and provides an interface to the client that can access the system in a very simplified way. Consider an example of a storekeeper. Now, when you, as a customer, visit a store to buy certain items, you're not aware of the layout of the store. You typically approach the storekeeper who is well aware of the store system. Based on your requirements, the storekeeper picks up items and hands them over to you. Isn't this easy? The customer need not know how the store looks and s/he gets the stuff done through a simple interface, the storekeeper. The Façade design pattern essentially does the following: It provides a unified interface to a set of interfaces in a subsystem and defines a high-level interface that helps the client use the subsystem in an easy way. Façade discusses representing a complex subsystem with a single interface object. It doesn't encapsulate the subsystem but actually combines the underlying subsystems. It promotes the decoupling of the implementation with multiple clients. A UML class diagram We will now discuss the Façade pattern with the help of the following UML diagram: As we observe the UML diagram, you'll realize that there are three main participants in this pattern: Façade: The main responsibility of a façade is to wrap up a complex group of subsystems so that it can provide a pleasing look to the outside world. System: This represents a set of varied subsystems that make the whole system compound and difficult to view or work with. Client: The client interacts with the Façade so that it can easily communicate with the subsystem and get the work completed. It doesn't have to bother about the complex nature of the system. You will now learn a little more about the three main participants from the data structure's perspective. Façade The following points will give us a better idea of Façade: It is an interface that knows which subsystems are responsible for a request It delegates the client's requests to the appropriate subsystem objects using composition For example, if the client is looking for some work to be accomplished, it need not have to go to individual subsystems but can simply contact the interface (Façade) that gets the work done System In the Façade world, System is an entity that performs the following: It implements subsystem functionality and is represented by a class. Ideally, a System is represented by a group of classes that are responsible for different operations. It handles the work assigned by the Façade object but has no knowledge of the façade and keeps no reference to it. For instance, when the client requests the Façade for a certain service, Façade chooses the right subsystem that delivers the service based on the type of service Client Here's how we can describe the client: The client is a class that instantiates the Façade It makes requests to the Façade to get the work done from the subsystems Implementing the Façade pattern in the real world To demonstrate the applications of the Façade pattern, let's take an example that we'd have experienced in our lifetime. Consider that you have a marriage in your family and you are in charge of all the arrangements. Whoa! That's a tough job on your hands. You have to book a hotel or place for marriage, talk to a caterer for food arrangements, organize a florist for all the decorations, and finally handle the musical arrangements expected for the event. In yesteryears, you'd have done all this by yourself, such as talking to the relevant folks, coordinating with them, negotiating on the pricing, but now life is simpler. You go and talk to an event manager who handles this for you. S/he will make sure that they talk to the individual service providers and get the best deal for you. From the Façade pattern perspective we will have the following three main participants: Client: It's you who need all the marriage preparations to be completed in time before the wedding. They should be top class and guests should love the celebrations. Façade: The event manager who's responsible for talking to all the folks that need to work on specific arrangements such as food, flower decorations, among others Subsystems: They represent the systems that provide services such as catering, hotel management, and flower decorations Let's develop an application in Python v3.5 and implement this use case. We start with the client first. It's you! Remember, you're the one who has been given the responsibility to make sure that the marriage preparations are done and the event goes fine! However, you're being clever here and passing on the responsibility to the event manager, isn't it? Let's now look at the You class. In this example, you create an object of the EventManager class so that the manager can work with the relevant folks on marriage preparations while you relax. class You(object):     def __init__(self):         print("You:: Whoa! Marriage Arrangements??!!!")     def askEventManager(self):         print("You:: Let's Contact the Event Manager\n\n")         em = EventManager()         em.arrange()     def __del__(self):         print("You:: Thanks to Event Manager, all preparations done! Phew!") Let's now move ahead and talk about the Façade class. As discussed earlier, the Façade class simplifies the interface for the client. In this case, EventManager acts as a façade and simplifies the work for You. Façade talks to the subsystems and does all the booking and preparations for the marriage on your behalf. Here is the Python code for the EventManager class: class EventManager(object):         def __init__(self):         print("Event Manager:: Let me talk to the folks\n")         def arrange(self):         self.hotelier = Hotelier()         self.hotelier.bookHotel()                 self.florist = Florist()         self.florist.setFlowerRequirements()                  self.caterer = Caterer()         self.caterer.setCuisine()                 self.musician = Musician()         self.musician.setMusicType() Now that we're done with the Façade and client, let's dive into the subsystems. We have developed the following classes for this scenario: Hotelier is for the hotel bookings. It has a method to check whether the hotel is free on that day (__isAvailable) and if it is free for booking the Hotel (bookHotel). The Florist class is responsible for flower decorations. Florist has the setFlowerRequirements() method to be used to set the expectations on the kind of flowers needed for the marriage decoration. The Caterer class is used to deal with the caterer and is responsible for the food arrangements. Caterer exposes the setCuisine() method to accept the type of cuisine to be served at the marriage. The Musician class is designed for musical arrangements at the marriage. It uses the setMusicType() method to understand the music requirements for the event. class Hotelier(object):     def __init__(self):         print("Arranging the Hotel for Marriage? --")         def __isAvailable(self):         print("Is the Hotel free for the event on given day?")         return True       def bookHotel(self):         if self.__isAvailable():             print("Registered the Booking\n\n")     class Florist(object):     def __init__(self):         print("Flower Decorations for the Event? --")         def setFlowerRequirements(self):         print("Carnations, Roses and Lilies would be used for Decorations\n\n")     class Caterer(object):     def __init__(self):         print("Food Arrangements for the Event --")         def setCuisine(self):         print("Chinese & Continental Cuisine to be served\n\n")     class Musician(object):     def __init__(self):         print("Musical Arrangements for the Marriage --")         def setMusicType(self):         print("Jazz and Classical will be played\n\n")   you = You() you.askEventManager() The output of the preceding code is given here: In the preceding code example: The EventManager class is the Façade that simplifies the interface for You EventManager uses composition to create objects of the subsystems such as Hotelier, Caterer, and others The principle of least knowledge As you have learned in the initial parts of this article, the Façade provides a unified system that makes subsystems easy to use. It also decouples the client from the subsystem of components. The design principle that is employed behind the Façade pattern is the principle of least knowledge. The principle of least knowledge guides us to reduce the interactions between objects to just a few friends that are close enough to you. In real terms, it means the following:: When designing a system, for every object created, one should look at the number of classes that it interacts with and the way in which the interaction happens. Following the principle, make sure that we avoid situations where there are many classes created tightly coupled to each other. If there are a lot of dependencies between classes, the system becomes hard to maintain. Any changes in one part of the system can lead to unintentional changes to other parts of the system, which means that the system is exposed to regressions and this should be avoided. Summary We began the article by first understanding the Façade design pattern and the context in which it's used. We understood the basis of Façade and how it is effectively used in software architecture. We looked at how Façade design patterns create a simplified interface for clients to use. It simplifies the complexity of subsystems so that the client benefits. The Façade doesn't encapsulate the subsystem and the client is free to access the subsystems even without going through the Façade. You also learned the pattern with a UML diagram and sample code implementation in Python v3.5. We understood the principle of least knowledge and how its philosophy governs the Façade design patterns. Further resources on this subject: Asynchronous Programming with Python [article] Optimization in Python [article] The Essentials of Working with Python Collections [article]
Read more
  • 0
  • 0
  • 4752

article-image-application-patterns
Packt
20 Oct 2015
9 min read
Save for later

Application Patterns

Packt
20 Oct 2015
9 min read
In this article by Marcelo Reyna, author of the book Meteor Design Patterns, we will cover application-wide patterns that share server- and client- side code. With these patterns, your code will become more secure and easier to manage. You will learn the following topic: Filtering and paging collections (For more resources related to this topic, see here.) Filtering and paging collections So far, we have been publishing collections without thinking much about how many documents we are pushing to the client. The more documents we publish, the longer it will take the web page to load. To solve this issue, we are going to learn how to show only a set number of documents and allow the user to navigate through the documents in the collection by either filtering or paging through them. Filters and pagination are easy to build with Meteor's reactivity. Router gotchas Routers will always have two types of parameters that they can accept: query parameters, and normal parameters. Query parameters are the objects that you will commonly see in site URLs followed by a question mark (<url-path>?page=1), while normal parameters are the type that you define within the route URL (<url>/<normal-parameter>/named_route/<normal-parameter-2>). It is a common practice to set query parameters on things such as pagination to keep your routes from creating URL conflicts. A URL conflict happens when two routes look the same but have different parameters. A products route such as /products/:page collides with a product detail route such as /products/:product-id. While both the routes are differently expressed because of the differences in their normal parameter, you arrive at both the routes using the same URL. This means that the only way the router can tell them apart is by routing to them programmatically. So the user would have to know that the FlowRouter.go() command has to be run in the console to reach either one of the products pages instead of simply using the URL. This is why we are going to use query parameters to keep our filtering and pagination stateful. Stateful pagination Stateful pagination is simply giving the user the option to copy and paste the URL to a different client and see the exact same section of the collection. This is important to make the site easy to share. Now we are going to understand how to control our subscription reactively so that the user can navigate through the entire collection. First, we need to set up our router to accept a page number. Then we will take this number and use it on our subscriber to pull in the data that we need. To set up the router, we will use a FlowRouter query parameter (the parameter that places a question mark next to the URL). Let's set up our query parameter: # /products/client/products.coffee Template.created "products", -> @autorun => tags = Session.get "products.tags" filter = page: Number(FlowRouter.getQueryParam("page")) or 0 if tags and not _.isEmpty tags _.extend filter, tags:tags order = Session.get "global.order" if order and not _.isEmpty order _.extend filter, order:order @subscribe "products", filter Template.products.helpers ... pages: current: -> FlowRouter.getQueryParam("page") or 0 Template.products.events "click .next-page": -> FlowRouter.setQueryParams page: Number(FlowRouter.getQueryParam("page")) + 1 "click .previous-page": -> if Number(FlowRouter.getQueryParam("page")) - 1 < 0 page = 0 else page = Number(FlowRouter.getQueryParam("page")) - 1 FlowRouter.setQueryParams page: page What we are doing here is straightforward. First, we extend the filter object with a page key that gets the current value of the page query parameter, and if this value does not exist, then it is set to 0. getQueryParam is a reactive data source, the autorun function will resubscribe when the value changes. Then we will create a helper for our view so that we can see what page we are on and the two events that set the page query parameter. But wait. How do we know when the limit to pagination has been reached? This is where the tmeasday:publish-counts package is very useful. It uses a publisher's special function to count exactly how many documents are being published. Let's set up our publisher: # /products/server/products_pub.coffee Meteor.publish "products", (ops={}) -> limit = 10 product_options = skip:ops.page * limit limit:limit sort: name:1 if ops.tags and not _.isEmpty ops.tags @relations collection:Tags ... collection:ProductsTags ... collection:Products foreign_key:"product" options:product_options mappings:[ ... ] else Counts.publish this,"products", Products.find() noReady:true @relations collection:Products options:product_options mappings:[ ... ] if ops.order and not _.isEmpty ops.order ... @ready() To publish our counts, we used the Counts.publish function. This function takes in a few parameters: Counts.publish <always this>,<name of count>, <collection to count>, <parameters> Note that we used the noReady parameter to prevent the ready function from running prematurely. By doing this, we generate a counter that can be accessed on the client side by running Counts.get "products". Now you might be thinking, why not use Products.find().count() instead? In this particular scenario, this would be an excellent idea, but you absolutely have to use the Counts function to make the count reactive, so if any dependencies change, they will be accounted for. Let's modify our view and helpers to reflect our counter: # /products/client/products.coffee ... Template.products.helpers pages: current: -> FlowRouter.getQueryParam("page") or 0 is_last_page: -> current_page = Number(FlowRouter.getQueryParam("page")) or 0 max_allowed = 10 + current_page * 10 max_products = Counts.get "products" max_allowed > max_products //- /products/client/products.jade template(name="products") div#products.template ... section#featured_products div.container div.row br.visible-xs //- PAGINATION div.col-xs-4 button.btn.btn-block.btn-primary.previous-page i.fa.fa-chevron-left div.col-xs-4 button.btn.btn-block.btn-info {{pages.current}} div.col-xs-4 unless pages.is_last_page button.btn.btn-block.btn-primary.next-page i.fa.fa-chevron-right div.clearfix br //- PRODUCTS +momentum(plugin="fade-fast") ... Great! Users can now copy and paste the URL to obtain the same results they had before. This is exactly what we need to make sure our customers can share links. If we had kept our page variable confined to a Session or a ReactiveVar, it would have been impossible to share the state of the webapp. Filtering Filtering and searching, too, are critical aspects of any web app. Filtering works similar to pagination; the publisher takes additional variables that control the filter. We want to make sure that this is stateful, so we need to integrate this into our routes, and we need to program our publishers to react to this. Also, the filter needs to be compatible with the pager. Let's start by modifying the publisher: # /products/server/products_pub.coffee Meteor.publish "products", (ops={}) -> limit = 10 product_options = skip:ops.page * limit limit:limit sort: name:1 filter = {} if ops.search and not _.isEmpty ops.search _.extend filter, name: $regex: ops.search $options:"i" if ops.tags and not _.isEmpty ops.tags @relations collection:Tags mappings:[ ... collection:ProductsTags mappings:[ collection:Products filter:filter ... ] else Counts.publish this,"products", Products.find filter noReady:true @relations collection:Products filter:filter ... if ops.order and not _.isEmpty ops.order ... @ready() To build any filter, we have to make sure that the property that creates the filter exists and _.extend our filter object based on this. This makes our code easier to maintain. Notice that we can easily add the filter to every section that includes the Products collection. With this, we have ensured that the filter is always used even if tags have filtered the data. By adding the filter to the Counts.publish function, we have ensured that the publisher is compatible with pagination as well. Let's build our controller: # /products/client/products.coffee Template.created "products", -> @autorun => ops = page: Number(FlowRouter.getQueryParam("page")) or 0 search: FlowRouter.getQueryParam "search" ... @subscribe "products", ops Template.products.helpers ... pages: search: -> FlowRouter.getQueryParam "search" ... Template.products.events ... "change .search": (event) -> search = $(event.currentTarget).val() if _.isEmpty search search = null FlowRouter.setQueryParams search:search page:null First, we have renamed our filter object to ops to keep things consistent between the publisher and subscriber. Then we have attached a search key to the ops object that takes the value of the search query parameter. Notice that we can pass an undefined value for search, and our subscriber will not fail, since the publisher already checks whether the value exists or not and extends filters based on this. It is always better to verify variables on the server side to ensure that the client doesn't accidentally break things. Also, we need to make sure that we know the value of that parameter so that we can create a new search helper under the pages helper. Finally, we have built an event for the search bar. Notice that we are setting query parameters to null whenever they do not apply. This makes sure that they do not appear in our URL if we do not need them. To finish, we need to create the search bar: //- /products/client/products.jade template(name="products") div#products.template header#promoter ... div#content section#features ... section#featured_products div.container div.row //- SEARCH div.col-xs-12 div.form-group.has-feedback input.input-lg.search.form-control(type="text" placeholder="Search products" autocapitalize="off" autocorrect="off" autocomplete="off" value="{{pages.search}}") span(style="pointer-events:auto; cursor:pointer;").form-control-feedback.fa.fa-search.fa-2x ... Notice that our search input is somewhat cluttered with special attributes. All these attributes ensure that our input is not doing the things that we do not want it to for iOS Safari. It is important to keep up with nonstandard attributes such as these to ensure that the site is mobile-friendly. You can find an updated list of these attributes here at https://developer.apple.com/library/safari/documentation/AppleApplications/Reference/SafariHTMLRef/Articles/Attributes.html. Summary This article covered how to control the amount of data that we publish. We also learned a pattern to build pagination that functions with filters as well, along with code examples. Resources for Article: Further resources on this subject: Building the next generation Web with Meteor[article] Quick start - creating your first application[article] Getting Started with Meteor [article]
Read more
  • 0
  • 0
  • 2108

article-image-patterns-traversing
Packt
25 Sep 2015
15 min read
Save for later

Patterns of Traversing

Packt
25 Sep 2015
15 min read
 In this article by Ryan Lemmer, author of the book Haskell Design Patterns, we will focus on two fundamental patterns of recursion: fold and map. The more primitive forms of these patterns are to be found in the Prelude, the "old part" of Haskell. With the introduction of Applicative, came more powerful mapping (traversal), which opened the door to type-level folding and mapping in Haskell. First, we will look at how Prelude's list fold is generalized to all Foldable containers. Then, we will follow the generalization of list map to all Traversable containers. Our exploration of fold and map culminates with the Lens library, which raises Foldable and Traversable to an even higher level of abstraction and power. In this article, we will cover the following: Traversable Modernizing Haskell Lenses (For more resources related to this topic, see here.) Traversable As with Prelude.foldM, mapM fails us beyond lists, for example, we cannot mapM over the Tree from earlier: main = mapM doF aTree >>= print -- INVALID The Traversable type-class is to map in the same way as Foldable is to fold: -- required: traverse or sequenceA class (Functor t, Foldable t) => Traversable (t :: * -> *) where -- APPLICATIVE form traverse :: Applicative f => (a -> f b) -> t a -> f (t b) sequenceA :: Applicative f => t (f a) -> f (t a) -- MONADIC form (redundant) mapM :: Monad m => (a -> m b) -> t a -> m (t b) sequence :: Monad m => t (m a) -> m (t a) The traverse fuction generalizes our mapA function, which was written for lists, to all Traversable containers. Similarly, Traversable.mapM is a more general version of Prelude.mapM for lists: mapM :: Monad m => (a -> m b) -> [a] -> m [b] mapM :: Monad m => (a -> m b) -> t a -> m (t b) The Traversable type-class was introduced along with Applicative: "we introduce the type class Traversable, capturing functorial data structures through which we can thread an applicative computation"                         Applicative Programming with Effects - McBride and Paterson A Traversable Tree Let's make our Traversable Tree. First, we'll do it the hard way: – a Traversable must also be a Functor and Foldable: instance Functor Tree where fmap f (Leaf x) = Leaf (f x) fmap f (Node x lTree rTree) = Node (f x) (fmap f lTree) (fmap f rTree) instance Foldable Tree where foldMap f (Leaf x) = f x foldMap f (Node x lTree rTree) = (foldMap f lTree) `mappend` (f x) `mappend` (foldMap f rTree) --traverse :: Applicative ma => (a -> ma b) -> mt a -> ma (mt b) instance Traversable Tree where traverse g (Leaf x) = Leaf <$> (g x) traverse g (Node x ltree rtree) = Node <$> (g x) <*> (traverse g ltree) <*> (traverse g rtree) data Tree a = Node a (Tree a) (Tree a) | Leaf a deriving (Show) aTree = Node 2 (Leaf 3) (Node 5 (Leaf 7) (Leaf 11)) -- import Data.Traversable main = traverse doF aTree where doF n = do print n; return (n * 2) The easier way to do this is to auto-implement Functor, Foldable, and Traversable: {-# LANGUAGE DeriveFunctor #-} {-# LANGUAGE DeriveFoldable #-} {-# LANGUAGE DeriveTraversable #-} import Data.Traversable data Tree a = Node a (Tree a) (Tree a)| Leaf a deriving (Show, Functor, Foldable, Traversable) aTree = Node 2 (Leaf 3) (Node 5 (Leaf 7) (Leaf 11)) main = traverse doF aTree where doF n = do print n; return (n * 2) Traversal and the Iterator pattern The Gang of Four Iterator pattern is concerned with providing a way "...to access the elements of an aggregate object sequentially without exposing its underlying representation"                                       "Gang of Four" Design Patterns, Gamma et al, 1995 In The Essence of the Iterator Pattern, Jeremy Gibbons shows precisely how the Applicative traversal captures the Iterator pattern. The Traversable.traverse class is the Applicative version of Traversable.mapM, which means it is more general than mapM (because Applicative is more general than Monad). Moreover, because mapM does not rely on the Monadic bind chain to communicate between iteration steps, Monad is a superfluous type for mapping with effects (Applicative is sufficient). In other words, Applicative traverse is superior to Monadic traversal (mapM): "In addition to being parametrically polymorphic in the collection elements, the generic traverse operation is parametrised along two further dimensions: the datatype being tra- versed, and the applicative functor in which the traversal is interpreted" "The improved compositionality of applicative functors over monads provides better glue for fusion of traversals, and hence better support for modular programming of iterations"                                        The Essence of the Iterator Pattern - Jeremy Gibbons Modernizing Haskell 98 The introduction of Applicative, along with Foldable and Traversable, had a big impact on Haskell. Foldable and Traversable lift Prelude fold and map to a much higher level of abstraction. Moreover, Foldable and Traversable also bring a clean separation between processes that preserve or discard the shape of the structure that is being processed. Traversable describes processes that preserve that shape of the data structure being traversed over. Foldable processes, in turn, discard or transform the shape of the structure being folded over. Since Traversable is a specialization of Foldable, we can say that shape preservation is a special case of shape transformation. This line between shape preservation and transformation is clearly visible from the fact that functions that discard their results (for example, mapM_, forM_, sequence_, and so on) are in Foldable, while their shape-preserving counterparts are in Traversable. Due to the relatively late introduction of Applicative, the benefits of Applicative, Foldable, and Traversable have not found their way into the core of the language. This is due to the change with the Foldable Traversable In Prelude proposal (planned for inclusion in the core libraries from GHC 7.10). For more information, visit https://wiki.haskell.org/Foldable_Traversable_In_Prelude. This will involve replacing less generic functions in Prelude, Control.Monad, and Data.List with their more polymorphic counterparts in Foldable and Traversable. There have been objections to the movement to modernize, the main concern being that more generic types are harder to understand, which may compromise Haskell as a learning language. These valid concerns will indeed have to be addressed, but it seems certain that the Haskell community will not resist climbing to new abstract heights. Lenses A Lens is a type that provides access to a particular part of a data structure. Lenses express a high-level pattern for composition. However, Lens is also deeply entwined with Traversable, and so we describe it in this article instead. Lenses relate to the getter and setter functions, which also describe access to parts of data structures. To find our way to the Lens abstraction (as per Edward Kmett's Lens library), we'll start by writing a getter and setter to access the root node of a Tree. Deriving Lens Returning to our Tree from earlier: data Tree a = Node a (Tree a) (Tree a) | Leaf a deriving (Show) intTree = Node 2 (Leaf 3) (Node 5 (Leaf 7) (Leaf 11)) listTree = Node [1,1] (Leaf [2,1]) (Node [3,2] (Leaf [5,2]) (Leaf [7,4])) tupleTree = Node (1,1) (Leaf (2,1)) (Node (3,2) (Leaf (5,2)) (Leaf (7,4))) Let's start by writing generic getter and setter functions: getRoot :: Tree a -> a getRoot (Leaf z) = z getRoot (Node z _ _) = z setRoot :: Tree a -> a -> Tree a setRoot (Leaf z) x = Leaf x setRoot (Node z l r) x = Node x l r main = do print $ getRoot intTree print $ setRoot intTree 11 print $ getRoot (setRoot intTree 11) If we want to pass in a setter function instead of setting a value, we use the following: fmapRoot :: (a -> a) -> Tree a -> Tree a fmapRoot f tree = setRoot tree newRoot where newRoot = f (getRoot tree) We have to do a get, apply the function, and then set the result. This double work is akin to the double traversal we saw when writing traverse in terms of sequenceA. In that case we resolved the issue by defining traverse first (and then sequenceA i.t.o. traverse): We can do the same thing here by writing fmapRoot to work in a single step (and then rewriting setRoot' i.t.o. fmapRoot'): fmapRoot' :: (a -> a) -> Tree a -> Tree a fmapRoot' f (Leaf z) = Leaf (f z) fmapRoot' f (Node z l r) = Node (f z) l r setRoot' :: Tree a -> a -> Tree a setRoot' tree x = fmapRoot' (_ -> x) tree main = do print $ setRoot' intTree 11 print $ fmapRoot' (*2) intTree The fmapRoot' function delivers a function to a particular part of the structure and returns the same structure: fmapRoot' :: (a -> a) -> Tree a -> Tree a To allow for I/O, we need a new function: fmapRootIO :: (a -> IO a) -> Tree a -> IO (Tree a) We can generalize this beyond I/O to all Monads: fmapM :: (a -> m a) -> Tree a -> m (Tree a) It turns out that if we relax the requirement for Monad, and generalize f' to all the Functor container types, then we get a simple van Laarhoven Lens! type Lens' s a = Functor f' => (a -> f' a) -> s -> f' s The remarkable thing about a van Laarhoven Lens is that given the preceding function type, we also gain "get", "set", "fmap", "mapM", and many other functions and operators. The Lens function type signature is all it takes to make something a Lens that can be used with the Lens library. It is unusual to use a type signature as "primary interface" for a library. The immediate benefit is that we can define a lens without referring to the Lens library. We'll explore more benefits and costs to this approach, but first let's write a few lenses for our Tree. The derivation of the Lens abstraction used here has been based on Jakub Arnold's Lens tutorial, which is available at http://blog.jakubarnold.cz/2014/07/14/lens-tutorial-introduction-part-1.html. Writing a Lens A Lens is said to provide focus on an element in a data structure. Our first lens will focus on the root node of a Tree. Using the lens type signature as our guide, we arrive at: lens':: Functor f => (a -> f' a) -> s -> f' s root :: Functor f' => (a -> f' a) -> Tree a -> f' (Tree a) Still, this is not very tangible; fmapRootIO is easier to understand with the Functor f' being IO: fmapRootIO :: (a -> IO a) -> Tree a -> IO (Tree a) fmapRootIO g (Leaf z) = (g z) >>= return . Leaf fmapRootIO g (Node z l r) = (g z) >>= return . (x -> Node x l r) displayM x = print x >> return x main = fmapRootIO displayM intTree If we drop down from Monad into Functor, we have a Lens for the root of a Tree: root :: Functor f' => (a -> f' a) -> Tree a -> f' (Tree a) root g (Node z l r) = fmap (x -> Node x l r) (g z) root g (Leaf z) = fmap Leaf (g z) As Monad is a Functor, this function also works with Monadic functions: main = root displayM intTree As root is a lens, the Lens library gives us the following: -– import Control.Lens main = do -- GET print $ view root listTree print $ view root intTree -- SET print $ set root [42] listTree print $ set root 42 intTree -- FMAP print $ over root (+11) intTree The over is the lens way of fmap'ing a function into a Functor. Composable getters and setters Another Lens on Tree might be to focus on the rightmost leaf: rightMost :: Functor f' => (a -> f' a) -> Tree a -> f' (Tree a) rightMost g (Node z l r) = fmap (r' -> Node z l r') (rightMost g r) rightMost g (Leaf z) = fmap (x -> Leaf x) (g z) The Lens library provides several lenses for Tuple (for example, _1 which brings focus to the first Tuple element). We can compose our rightMost lens with the Tuple lenses: main = do print $ view rightMost tupleTree print $ set rightMost (0,0) tupleTree -- Compose Getters and Setters print $ view (rightMost._1) tupleTree print $ set (rightMost._1) 0 tupleTree print $ over (rightMost._1) (*100) tupleTree A Lens can serve as a getter, setter, or "function setter". We are composing lenses using regular function composition (.)! Note that the order of composition is reversed in (rightMost._1) the rightMost lens is applied before the _1 lens. Lens Traversal A Lens focuses on one part of a data structure, not several, for example, a lens cannot focus on all the leaves of a Tree: set leaves 0 intTree over leaves (+1) intTree To focus on more than one part of a structure, we need a Traversal class, the Lens generalization of Traversable). Whereas Lens relies on Functor, Traversal relies on Applicative. Other than this, the signatures are exactly the same: traversal :: Applicative f' => (a -> f' a) -> Tree a -> f' (Tree a) lens :: Functor f'=> (a -> f' a) -> Tree a -> f' (Tree a) A leaves Traversal delivers the setter function to all the leaves of the Tree: leaves :: Applicative f' => (a -> f' a) -> Tree a -> f' (Tree a) leaves g (Node z l r) = Node z <$> leaves g l <*> leaves g r leaves g (Leaf z) = Leaf <$> (g z) We can use set and over functions with our new Traversal class: set leaves 0 intTree over leaves (+1) intTree The Traversals class compose seamlessly with Lenses: main = do -- Compose Traversal + Lens print $ over (leaves._1) (*100) tupleTree -- Compose Traversal + Traversal print $ over (leaves.both) (*100) tupleTree -- map over each elem in target container (e.g. list) print $ over (leaves.mapped) (*(-1)) listTree -- Traversal with effects mapMOf leaves displayM tupleTree (The both is a Tuple Traversal that focuses on both elements). Lens.Fold The Lens.Traversal lifts Traversable into the realm of lenses: main = do print $ sumOf leaves intTree print $ anyOf leaves (>0) intTree The Lens Library We used only "simple" Lenses so far. A fully parametrized Lens allows for replacing parts of a data structure with different types: type Lens s t a b = Functor f' => (a -> f' b) -> s -> f' t –- vs simple Lens type Lens' s a = Lens s s a a Lens library function names do their best to not clash with existing names, for example, postfixing of idiomatic function names with "Of" (sumOf, mapMOf, and so on), or using different verb forms such as "droppingWhile" instead of "dropWhile". While this creates a burden as i.t.o has to learn new variations, it does have a big plus point—it allows for easy unqualified import of the Lens library. By leaving the Lens function type transparent (and not obfuscating it with a new type), we get Traversals by simply swapping out Functor for Applicative. We also get to define lenses without having to reference the Lens library. On the downside, Lens type signatures can be bewildering at first sight. They form a language of their own that requires effort to get used to, for example: mapMOf :: Profunctor p => Over p (WrappedMonad m) s t a b -> p a (m b) -> s -> m t foldMapOf :: Profunctor p => Accessing p r s a -> p a r -> s -> r On the surface, the Lens library gives us composable getters and setters, but there is much more to Lenses than that. By generalizing Foldable and Traversable into Lens abstractions, the Lens library lifts Getters, Setters, Lenses, and Traversals into a unified framework in which they are all compose together. Edward Kmett's Lens library is a sprawling masterpiece that is sure to leave a lasting impact on idiomatic Haskell. Summary We started with Lists (Haskel 98), then generalizing for all Traversable containers (Introduced in the mid-2000s). Following that, we saw how the Lens library (2012) places traversing in an even broader context. Lenses give us a unified vocabulary to navigate data structures, which explains why it has been described as a "query language for data structures". Resources for Article: Further resources on this subject: Plotting in Haskell[article] The Hunt for Data[article] Getting started with Haskell [article]
Read more
  • 0
  • 0
  • 4005
Banner background image

article-image-cassandra-design-patterns
Packt
22 Sep 2015
18 min read
Save for later

Cassandra Design Patterns

Packt
22 Sep 2015
18 min read
In this article by Rajanarayanan Thottuvaikkatumana, author of the book Cassandra Design Patterns, Second Edition, the author has discussed how Apache Cassandra is one of the most popular NoSQL data stores. He states this based on the research paper Dynamo: Amazon’s Highly Available Key-Value Store and the research paper Bigtable: A Distributed Storage System for Structured Data. Cassandra is implemented with best features from both of these research papers. In general, NoSQL data stores can be classified into the following groups: Key-value data store Column family data store Document data store Graph data store Cassandra belongs to the column family data store group. Cassandra’s peer-to-peer architecture avoids single point failures in the cluster of Cassandra nodes and gives the ability to distribute the nodes across racks or data centres. This makes Cassandra a linearly scalable data store. In other words, the more processing you need, the more Cassandra nodes you can add to your cluster. Cassandra’s multi data centre support makes it a perfect choice to replicate the data stores across data centres for disaster recovery, high availability, separating transaction processing, analytical environments, and for building resiliency into the data store infrastructure.   Design patterns in Cassandra The term “design patterns” is a highly misinterpreted term in the software development community. In an extremely general sense, it is a set of solutions for some known problems in quite a specific context. It is used in this book to describe a pattern of using certain features of Cassandra to solve some real-world problems. This book is a collection of such design patterns with real-world examples. Coexistence patterns Cassandra is one of the highly successful NoSQL data stores, which is greatly similar to the traditional RDBMS. Cassandra column families (also known as Cassandra tables), in a logical perspective, have a similarity with RDBMS-based tables in the view of the users, even though the underlying structure of these tables are totally different. Because of this, Cassandra is best fit to be deployed along with the traditional RDBMS to solve some of the problems that RDBMS is not able to handle. The caveat here is that because of the similarity of RDBMS tables and Cassandra column families in the view of the end users, many users and data modelers try to use Cassandra in the exact the same way as the RDBMS schema is being modeled, used, and getting into serious deployment issues. How do you prevent such pitfalls? The key here is to understand the differences in a theoretical perspective as well as in a practical perspective, and follow best practices prescribed by the creators of Cassandra. Where do you start with Cassandra? The best place to look at is the new application development requirements and take it from there. Look at the cases where there is a need to normalize the RDBMS tables and keep all the data items together, which would have got distributed if you were to design the same solution in RDBMS. Instead of thinking from the pure data model perspective, start thinking in terms of the application's perspective. How the data is generated by the application, what are the read requirements, what are the write requirements, what is the response time expected out of some of the use cases, and so on. Depending on these aspects, design the data model. In the big data world, the application becomes the first class citizen and the data model leaves the driving seat in the application design. Design the data model to serve the needs of the applications. In any organization, new reporting requirements come all the time. The major challenge in to generate reports is the underlying data store. In the RDBMS world, reporting is always a challenge. You may have to join multiple tables to generate even simple reports. Even though the RDBMS objects such as views, stored procedures, and indexes maybe used to get the desired data for the reports, when the report is being generated, the query plan is going to be very complex most of the time. The consumption of processing power is another need to consider when generating such reports on the fly. Because of these complexities, many times, for reporting requirements, it is common to keep separate tables containing data exported from the transactional tables. This is a great opportunity to start with NoSQL stores like Cassandra as a reporting data store. Data aggregation and summarization are common requirements in any organization. This helps to control the data growth by storing only the summary statistics and moving the transactional data into archives. Many times, this aggregated and summarized data is used for statistical analysis. Making the summary accurate and easily accessible is a big challenge. Most of the time, data aggregation and reporting goes hand in hand. The aggregated data is heavily used in reports. The aggregation process speeds up the queries to a great extent. This is another place where you can start with NoSQL stores like Cassandra. The coexistence of RDBMS and NoSQL data stores like Cassandra is very much possible, feasible, and sensible; and this is the only way to get started with the NoSQL movement, unless you embark on a totally new product development from scratch. In summary, this section of the book discusses about some design patterns related to de-normalization, reporting, and aggregation of data using Cassandra as the preferred NoSQL data store. RDBMS migration patterns A big bang approach to any kind of technology migration is not advisable. A series of deliberations have to happen before the eventual and complete change over. Migration from RDBMS to Cassandra is not different at all. Any new technology replacing an old one must coexist harmoniously, at least for a short period of time. This gives a lot of confidence on the new technology to the stakeholders. Many technology pundits give various approaches on the RDBMS to NoSQL migration strategies. Many such guidelines are specific to the particular NoSQL data stores giving attention to specific areas, and most of the time, this will end up on the process rather than the technology. The migration from RDBMS to Cassandra is not an easy task. Mainly because the RDBMS-based systems are really time tested and trust worthy in most of the organizations. So, migrating from such a robust RDBMS-based system to Cassandra is not going to be easy for anyone. One of the best approaches to achieve this goal is to exploit some of the new or unique features in Cassandra, which many of the traditional RDBMS don't have. This also prevents the usage of Cassandra just like any other RDBMS. Cassandra is unique. Cassandra is not an RDBMS. The approach of banking on the unique features is not only applicable to the RDBMS to Cassandra migration, but also to any migration from one paradigm to another. Some of the design patterns that are discussed in this section of the book revolve around very simple and important features of Cassandra, but have profound application potential when designing the next generation NoSQL data stores using Cassandra. A wise usage of these unique features in Cassandra will give a head start on the eventual and complete migration from RDBMS. The modeling of collection objects in RDBMS is a real pain, because multiple tables are to be defined and a join is required to access data. Many RDBMS offer this by providing capability to define user-defined data types, but there is absolutely no standardization at all in this space. Collection objects are very commonly seen in the real-world applications. A list of actions, tuple of related values, set of objects, dictionaries, and things like that come quite often in applications. Cassandra has elegant ways to model this because they are data types in column families. Counting is a very commonly required process in many business processes and applications. In RDBMS, this has to be modeled as integers or long numbers, but many times, applications make big mistakes in using them in wrong ways. Cassandra has a counter data type in the column family that alleviates this problem. Getting rid of unwanted records from an RDBMS table is not an automatic process. When some application events occur, they have to be removed by application programs or through some other means. But in many situations, many data items will have a preallocated time to live. They should go away without the intervention of any external events. Cassandra has a way to assign time-to-live (TTL) attribute to data items. By making use of TTL, data items get removed without any other external event's intervention. All the design patterns covered in this section of the book revolve around some of the new features of Cassandra that will make the migration from RDBMS to Cassandra an easy task. Cache migration pattern Database access whether it is from RDBMS or other highly distributed NoSQL data stores is always an input/output (I/O) intensive operation. It makes perfect sense to cache the frequently used, but reasonably static data for fast access for the applications consuming this data. In such situations, the in-memory cache is preferred to the repeated database access for each request. Using cache is not always a pleasant experience. Getting into really weird problems such as data loss, data getting out of sync with its source and other data integrity problems are very common. It is very common to see wrong components coming into the enterprise solution stack all the time for various reasons. Overlooking on some of the features and adopting the technology without much background work is a very common pitfall. Many a times, the use of cache comes into the solution stack to reduce the latency of the responses. Once the initial results are favorable, more and more data will get tossed into the cache. Slowly, this will become a practice to see that more and more data is getting into cache. Now is the time when problems start popping up one by one. Pure in-memory cache solutions are favored by everybody, by the virtue of its ability to serve the data quickly until you start loosing data. This is because of the faults in the system, along with application and node crashes. Cache serves data much faster than being served from other data stores. But if the caching solution in use is giving data integrity problems, it is better to migrate to NoSQL data stores like Cassandra. Is Cassandra faster than the in-memory caching solutions? The obvious answer is no. But it is not as bad as many think. Cassandra can be configured to serve fast reads, and bonus comes in the form of high data integrity with strong replication capabilities. Cache is good as long as it serves its purpose without any data loss or any other data integrity issues. Emphasizing on the use case of the key/value type cache and various methods of cache to NoSQL migration are discussed in this section of the book. Cassandra cannot be used as a replacement for cache in terms of the speed of data access. But when it comes to data integrity, Cassandra shines all the time with its tuneable consistency feature. With a continual tuning and manipulating data with clean and well-written application code, data access can be improved to a great level, and it will be much better than many other data stores. The design pattern covered in this section of the book gives some guidance on migrating from caching solutions to Cassandra, if this is a must. CAP patterns When it comes to large-scale Internet applications or web services, popularly known as the Internet of Things (IoT) applications, the number of components are huge and the way they are distributed is beyond imagination. There will be hundreds of application servers, hundreds of data store nodes, and many other components in the whole ecosystem. In such a scenario, for doing an atomic transaction by getting an agreement from all the components involved is, for all practical purposes, impossible. Consistency, availability, and partition tolerance are three important guarantees, popularly known as CAP guarantees that any distributed computing systems should offer even though all is not possible simultaneously. In the IoT applications, the distribution of the application nodes is unavoidable. This means that the possibility of network partition is pretty much there. So, it is mandatory to give the P guarantee. Now, the question is whether to forfeit the C guarantee or the A guarantee. At this stage, the situation is not as grave as portrayed in the CAP Theorem conjectured by Eric Brewer. For all the use cases in a given IoT application, there is no need of having 100% of C guarantee and 100% of A guarantee. So, depending on the need of the level of A guarantee, the C guarantee can be tuned. In other words, it is called tunable consistency. Depending on the way data is ingested into Cassandra, and the way it is consumed from Cassandra, tuning is possible to give best results for the appropriate read and write requirements of the applications. In some applications, the speed at which the data is written will be very high. In other words, the velocity of the data ingestion into Cassandra is very high. This falls into the write-heavy applications. In some applications, the need to read data quickly will be an important requirement. This is mainly needed in the applications where there is a lot of data processing required. Data analytics applications, batch processing applications, and so on fall under this category. These fall into the read-heavy applications. Now, there is a third category of applications where there is an equal importance for fast writes as well as fast reads. These are the kind of applications where there is a constant inflow of data, and at the same time, there is a need to read the data by clients for various purposes. This falls into the read-write balanced applications. The consistency level requirements for all the previous three types of applications are totally different. There is no one way to tune so that it is optimal for all the three types of applications. All the three applications' consistency levels are to be tuned differently from use case to use case. In this section of the book, various design patterns related to applications with the needs of fast writes, fast reads, and moderate write and read are discussed. All these design patterns revolve around using the tuneable consistency parameters of Cassandra. Whether it is for write or read and if the consistency levels are set high, the availability levels will be low and vice versa. So, by making use of the consistency level knob, the Cassandra data store can be used for various types of writing and reading use cases. Temporal patterns In any applications, the usage of data that varies over the period of time is called as temporal data, which is very important. Temporal data is needed wherever there is a need to maintain chronology. There are so many applications in which there is a huge need for storage, retrieval, and processing of data that is tied to time. The biggest challenge in dealing with temporal data stored in a data store is that they are hugely used for analytical purposes and retrieving the data, based on various sort orders in terms of time. So, the data stores that are used to capture the temporal data should be capable of storing the data strictly adhering to the chronology. There are so many usage patterns that are seen in the real world that fall into showing temporal behavior. For the classification purpose in this book, they are bucketed into three. The first one is the general time series category. The second one is the log category, such as in an audit log, a transaction log, and so on. The third one is the conversation category, such as in the conversation messages of a chat application. There is relevance in this classification, because these are commonly used across in many of the applications. In many of the applications, these are really cross cutting concerns; and designers underestimate this aspect; and finally, many of the applications will have different data stores capturing this temporal data. There is a need to have a common strategy dealing with temporal data that fall in these three commonly seen categories in an enterprise wide solution architecture. In other words, there should be a uniform way of capturing temporal data; there should be a uniform way of processing temporal data; and there should be a commonly used set of tools and libraries to manage the temporal data. Out of the three design patterns that are discussed in this section of the book, the first Time Series pattern is a general design pattern that covers the most general behavior of any kind of temporal data. The next two design patterns namely Log pattern and Conversation pattern are two special cases of the first design pattern. This section of the book covers the general nature of temporal data, some specific instances of such data items in the real-world applications, and why Cassandra is the best fit as a NoSQL data store to persist the temporal data. Temporal data comes quite often in many use cases of lots of applications. Data modeling of temporal data is very important in the Cassandra perspective for optimal storage and quick access of the data. Some common design patterns to model temporal data have been covered in this section of the book. By focusing on some very few aspects, such as the partition key, primary key, clustering column and the number of records that gets stored in a wide row of Cassandra, very effective and high performing temporal data models can be built. Analytical patterns The 3Vs of big data namely Volume, Variety, and Velocity pose another big challenge, which is the analysis of the data stored in NoSQL data stores, such as Cassandra. What are the analytics use cases? How can the distributed data be processed? What are the data transformations that are typically seen in the applications? These are the topics covered in this section of the book. Unlike other sections of this book, the focus is shifted from Cassandra to other technologies like Apache Hadoop, Hadoop MapReduce, and Apache Spark to introduce the big data analytics tool space. The design patterns such as Map/Reduce Pattern and Transformation Pattern are very commonly seen in the data analytics world. Cassandra with Apache Spark has good compatibility, and is a very ideal tool set in the data analysis use cases. This section of the book covers some data analysis aspects and mainly discusses about data processing. Data transformation is one of the major activity in data processing. Out of the many data processing patterns, Map/Reduce Pattern deserves a special mention, because it is being used in so many batch processing and analysis use cases, dealing with big data. Spark has been chosen as the tool of choice to explain the data processing activities. This section explains how a Map/Reduce kind of data processing task can be done using Cassandra. Spark has also been discussed, which is very powerful to perform online data analysis. This section of the book also covers some of the commonly seen data transformations that are used in the data processing applications. Summary Many Cassandra design patterns have been covered in this book. If the design patterns are not being used in any real-world applications, it has only theoretical value. To give a practical approach to the applicability of these design patterns, an end-to-end application is taken as a case point and described as the last chapter of the book, which is used as a vehicle to explain the applicability of the Cassandra design patterns discussed in the earlier sections of the book. Users love Cassandra because of its SQL-like interface CQL. Also, its features are very closely related to the RDBMS even though the paradigm is totally new. Application developers love Cassandra because of the plethora of drivers available in the market so that they can write applications in their preferred programming language. Architects love Cassandra because they can store structured, semi-structured, and unstructured data in it. Database administers love Cassandra because it comes with almost no maintenance overhead. Service managers love Cassandra because of the wonderful monitoring tools available in the market. CIOs love Cassandra because it gives value for their money. And Cassandra works! An application based on Cassandra will be perfect only if its features are used in the right way, and this book is an attempt to guide the Cassandra community in this direction. Resources for Article: Further resources on this subject: Cassandra Architecture [article] Getting Up and Running with Cassandra [article] Getting Started with Apache Cassandra [article]
Read more
  • 0
  • 0
  • 4184

article-image-getting-started-meteor
Packt
14 Sep 2015
6 min read
Save for later

Getting Started with Meteor

Packt
14 Sep 2015
6 min read
In this article, based on Marcelo Reyna's book Meteor Design Patterns, we will see that when you want to develop an application of any kind, you want to develop it fast. Why? Because the faster you develop, the better your return on investment will be (your investment is time, and the real cost is the money you could have produced with that time). There are two key ingredients ofrapid web development: compilers and patterns. Compilers will help you so that youdon’t have to type much, while patterns will increase the paceat which you solve common programming issues. Here, we will quick-start compilers and explain how they relate withMeteor, a vast but simple topic. The compiler we will be looking at isCoffeeScript. (For more resources related to this topic, see here.) CoffeeScriptfor Meteor CoffeeScript effectively replaces JavaScript. It is much faster to develop in CoffeeScript, because it simplifies the way you write functions, objects, arrays, logical statements, binding, and much more.All CoffeeScript files are saved with a .coffee extension. We will cover functions, objects, logical statements, and binding, since thisis what we will use the most. Objects and arrays CoffeeScriptgets rid of curly braces ({}), semicolons (;), and commas (,). This alone saves your fingers from repeating unnecessary strokes on the keyboard. CoffeeScript instead emphasizes on the proper use of tabbing. Tabbing will not only make your code more readable (you are probably doing it already), but also be a key factor inmaking it work. Let’s look at some examples: #COFFEESCRIPT toolbox = hammer:true flashlight:false Here, we are creating an object named toolbox that contains two keys: hammer and flashlight. The equivalent in JavaScript would be this: //JAVASCRIPT - OUTPUT var toolbox = { hammer:true, flashlight:false }; Much easier! As you can see, we have to tab to express that both the hammer and the flashlight properties are a part of toolbox. The word var is not allowed in CoffeeScript because CoffeeScript automatically applies it for you. Let’stakea look at how we would createan array: #COFFEESCRIPT drill_bits = [ “1/16 in” “5/64 in” “3/32 in” “7/64 in” ] //JAVASCRIPT – OUTPUT vardrill_bits; drill_bits = [“1/16 in”,”5/64 in”,”3/32 in”,”7/64 in”]; Here, we can see we don’t need any commas, but we do need brackets to determine that this is an array. Logical statements and operators CoffeeScript also removes a lot ofparentheses (()) in logical statements and functions. This makes the logic of the code much easier to understand at the first glance. Let’s look at an example: #COFFEESCRIPT rating = “excellent” if five_star_rating //JAVASCRIPT – OUTPUT var rating; if(five_star_rating){ rating = “excellent”; } In this example, we can clearly see thatCoffeeScript is easier to read and write.Iteffectively replaces all impliedparentheses in any logical statement. Operators such as &&, ||, and !== are replaced by words. Here is a list of the operators that you will be using the most: CoffeeScript JavaScript is === isnt !== not ! and && or || true, yes, on true false, no, off false @, this this Let's look at a slightly more complex logical statement and see how it compiles: #COFFEESCRIPT # Suppose that “this” is an object that represents a person and their physical properties if@eye_coloris “green” retina_scan = “passed” else retina_scan = “failed” //JAVASCRIPT - OUTPUT if(this.eye_color === “green”){ retina_scan = “passed”; } else { retina_scan = “failed”; } When using @eye_color to substitute for this.eye_color, notice that we do not need . Functions JavaScript has a couple of ways of creating functions. They look like this: //JAVASCRIPT //Save an anonymous function onto a variable varhello_world = function(){ console.log(“Hello World!”); } //Declare a function functionhello_world(){ console.log(“Hello World!”); } CoffeeScript uses ->instead of the function()keyword.The following example outputs a hello_world function: #COFFEESCRIPT #Create a function hello_world = -> console.log “Hello World!” //JAVASCRIPT - OUTPUT varhello_world; hello_world = function(){ returnconsole.log(“Hello World!”); } Once again, we use a tab to express the content of the function, so there is no need ofcurly braces ({}). This means that you have to make sure you have all of the logic of the function tabbed under its namespace. But what about our parameters? We can use (p1,p2) -> instead, where p1 and p2 are parameters. Let’s make our hello_world function say our name: #COFFEESCRIPT hello_world = (name) -> console.log “Hello #{name}” //JAVSCRIPT – OUTPUT varhello_world; hello_world = function(name) { returnconsole.log(“Hello “ + name); } In this example, we see how the special word function disappears and string interpolation. CoffeeScript allows the programmer to easily add logic to a string by escaping the string with #{}. Unlike JavaScript, you can also add returns and reshape the way astring looks without breaking the code. Binding In Meteor, we will often find ourselves using the properties of bindingwithin nested functions and callbacks.Function binding is very useful for these types of cases and helps avoid having to save data in additional variables. Let’s look at an example: #COFFEESCRIPT # Let’s make the context of this equal to our toolbox object # this = # hammer:true # flashlight:false # Run a method with a callback Meteor.call “use_hammer”, -> console.log this In this case, the thisobjectwill return a top-level object, such as the browser window. That's not useful at all. Let’s bind it now: #COFFEESCRIPT # Let’s make the context of this equal to our toolbox object # this = # hammer:true # flashlight:false # Run a method with a callback Meteor.call “use_hammer”, => console.log this The key difference is the use of =>instead of the expected ->sign fordefining the function. This will ensure that the callback'sthis object contains the context of the executing function. The resulting compiled script is as follows: //JAVASCRIPT Meteor.call(“use_hammer”, (function(_this) { return function() { returnConsole.log(_this); }; })(this)); CoffeeScript will improve your code and help you write codefaster. Still, itis not flawless. When you start combining functions with nested arrays, things can get complex and difficult to read, especially when functions are constructed with multiple parameters. Let’s look at an ugly query: #COFFEESCRIPT People.update sibling: $in:[“bob”,”bill”] , limit:1 -> console.log “success!” There are a few ways ofexpressing the difference between two different parameters of a function, but by far the easiest to understand. We place a comma one indentation before the next object. Go to coffeescript.org and play around with the language by clicking on the try coffeescript link. Summary We can now program faster because we have tools such as CoffeeScript, Jade, and Stylus to help us. We also seehow to use templates, helpers, and events to make our frontend work with Meteor. Resources for Article: Further resources on this subject: Why Meteor Rocks! [article] Function passing [article] Meteor.js JavaScript Framework: Why Meteor Rocks! [article]
Read more
  • 0
  • 0
  • 1753

article-image-dealing-legacy-code
Packt
31 Mar 2015
16 min read
Save for later

Dealing with Legacy Code

Packt
31 Mar 2015
16 min read
In this article by Arun Ravindran, author of the book Django Best Practices and Design Patterns, we will discuss the following topics: Reading a Django code base Discovering relevant documentation Incremental changes versus full rewrites Writing tests before changing code Legacy database integration (For more resources related to this topic, see here.) It sounds exciting when you are asked to join a project. Powerful new tools and cutting-edge technologies might await you. However, quite often, you are asked to work with an existing, possibly ancient, codebase. To be fair, Django has not been around for that long. However, projects written for older versions of Django are sufficiently different to cause concern. Sometimes, having the entire source code and documentation might not be enough. If you are asked to recreate the environment, then you might need to fumble with the OS configuration, database settings, and running services locally or on the network. There are so many pieces to this puzzle that you might wonder how and where to start. Understanding the Django version used in the code is a key piece of information. As Django evolved, everything from the default project structure to the recommended best practices have changed. Therefore, identifying which version of Django was used is a vital piece in understanding it. Change of Guards Sitting patiently on the ridiculously short beanbags in the training room, the SuperBook team waited for Hart. He had convened an emergency go-live meeting. Nobody understood the "emergency" part since go live was at least 3 months away. Madam O rushed in holding a large designer coffee mug in one hand and a bunch of printouts of what looked like project timelines in the other. Without looking up she said, "We are late so I will get straight to the point. In the light of last week's attacks, the board has decided to summarily expedite the SuperBook project and has set the deadline to end of next month. Any questions?" "Yeah," said Brad, "Where is Hart?" Madam O hesitated and replied, "Well, he resigned. Being the head of IT security, he took moral responsibility of the perimeter breach." Steve, evidently shocked, was shaking his head. "I am sorry," she continued, "But I have been assigned to head SuperBook and ensure that we have no roadblocks to meet the new deadline." There was a collective groan. Undeterred, Madam O took one of the sheets and began, "It says here that the Remote Archive module is the most high-priority item in the incomplete status. I believe Evan is working on this." "That's correct," said Evan from the far end of the room. "Nearly there," he smiled at others, as they shifted focus to him. Madam O peered above the rim of her glasses and smiled almost too politely. "Considering that we already have an extremely well-tested and working Archiver in our Sentinel code base, I would recommend that you leverage that instead of creating another redundant system." "But," Steve interrupted, "it is hardly redundant. We can improve over a legacy archiver, can't we?" "If it isn't broken, then don't fix it", replied Madam O tersely. He said, "He is working on it," said Brad almost shouting, "What about all that work he has already finished?" "Evan, how much of the work have you completed so far?" asked O, rather impatiently. "About 12 percent," he replied looking defensive. Everyone looked at him incredulously. "What? That was the hardest 12 percent" he added. O continued the rest of the meeting in the same pattern. Everybody's work was reprioritized and shoe-horned to fit the new deadline. As she picked up her papers, readying to leave she paused and removed her glasses. "I know what all of you are thinking... literally. But you need to know that we had no choice about the deadline. All I can tell you now is that the world is counting on you to meet that date, somehow or other." Putting her glasses back on, she left the room. "I am definitely going to bring my tinfoil hat," said Evan loudly to himself. Finding the Django version Ideally, every project will have a requirements.txt or setup.py file at the root directory, and it will have the exact version of Django used for that project. Let's look for a line similar to this: Django==1.5.9 Note that the version number is exactly mentioned (rather than Django>=1.5.9), which is called pinning. Pinning every package is considered a good practice since it reduces surprises and makes your build more deterministic. Unfortunately, there are real-world codebases where the requirements.txt file was not updated or even completely missing. In such cases, you will need to probe for various tell-tale signs to find out the exact version. Activating the virtual environment In most cases, a Django project would be deployed within a virtual environment. Once you locate the virtual environment for the project, you can activate it by jumping to that directory and running the activated script for your OS. For Linux, the command is as follows: $ source venv_path/bin/activate Once the virtual environment is active, start a Python shell and query the Django version as follows: $ python >>> import django >>> print(django.get_version()) 1.5.9 The Django version used in this case is Version 1.5.9. Alternatively, you can run the manage.py script in the project to get a similar output: $ python manage.py --version 1.5.9 However, this option would not be available if the legacy project source snapshot was sent to you in an undeployed form. If the virtual environment (and packages) was also included, then you can easily locate the version number (in the form of a tuple) in the __init__.py file of the Django directory. For example: $ cd envs/foo_env/lib/python2.7/site-packages/django $ cat __init__.py VERSION = (1, 5, 9, 'final', 0) ... If all these methods fail, then you will need to go through the release notes of the past Django versions to determine the identifiable changes (for example, the AUTH_PROFILE_MODULE setting was deprecated since Version 1.5) and match them to your legacy code. Once you pinpoint the correct Django version, then you can move on to analyzing the code. Where are the files? This is not PHP One of the most difficult ideas to get used to, especially if you are from the PHP or ASP.NET world, is that the source files are not located in your web server's document root directory, which is usually named wwwroot or public_html. Additionally, there is no direct relationship between the code's directory structure and the website's URL structure. In fact, you will find that your Django website's source code is stored in an obscure path such as /opt/webapps/my-django-app. Why is this? Among many good reasons, it is often more secure to move your confidential data outside your public webroot. This way, a web crawler would not be able to accidentally stumble into your source code directory. Starting with urls.py Even if you have access to the entire source code of a Django site, figuring out how it works across various apps can be daunting. It is often best to start from the root urls.py URLconf file since it is literally a map that ties every request to the respective views. With normal Python programs, I often start reading from the start of its execution—say, from the top-level main module or wherever the __main__ check idiom starts. In the case of Django applications, I usually start with urls.py since it is easier to follow the flow of execution based on various URL patterns a site has. In Linux, you can use the following find command to locate the settings.py file and the corresponding line specifying the root urls.py: $ find . -iname settings.py -exec grep -H 'ROOT_URLCONF' {} ; ./projectname/settings.py:ROOT_URLCONF = 'projectname.urls'   $ ls projectname/urls.py projectname/urls.py Jumping around the code Reading code sometimes feels like browsing the web without the hyperlinks. When you encounter a function or variable defined elsewhere, then you will need to jump to the file that contains that definition. Some IDEs can do this automatically for you as long as you tell it which files to track as part of the project. If you use Emacs or Vim instead, then you can create a TAGS file to quickly navigate between files. Go to the project root and run a tool called Exuberant Ctags as follows: find . -iname "*.py" -print | etags - This creates a file called TAGS that contains the location information, where every syntactic unit such as classes and functions are defined. In Emacs, you can find the definition of the tag, where your cursor (or point as it called in Emacs) is at using the M-. command. While using a tag file is extremely fast for large code bases, it is quite basic and is not aware of a virtual environment (where most definitions might be located). An excellent alternative is to use the elpy package in Emacs. It can be configured to detect a virtual environment. Jumping to a definition of a syntactic element is using the same M-. command. However, the search is not restricted to the tag file. So, you can even jump to a class definition within the Django source code seamlessly. Understanding the code base It is quite rare to find legacy code with good documentation. Even if you do, the documentation might be out of sync with the code in subtle ways that can lead to further issues. Often, the best guide to understand the application's functionality is the executable test cases and the code itself. The official Django documentation has been organized by versions at https://docs.djangoproject.com. On any page, you can quickly switch to the corresponding page in the previous versions of Django with a selector on the bottom right-hand section of the page: In the same way, documentation for any Django package hosted on readthedocs.org can also be traced back to its previous versions. For example, you can select the documentation of django-braces all the way back to v1.0.0 by clicking on the selector on the bottom left-hand section of the page: Creating the big picture Most people find it easier to understand an application if you show them a high-level diagram. While this is ideally created by someone who understands the workings of the application, there are tools that can create very helpful high-level depiction of a Django application. A graphical overview of all models in your apps can be generated by the graph_models management command, which is provided by the django-command-extensions package. As shown in the following diagram, the model classes and their relationships can be understood at a glance: Model classes used in the SuperBook project connected by arrows indicating their relationships This visualization is actually created using PyGraphviz. This can get really large for projects of even medium complexity. Hence, it might be easier if the applications are logically grouped and visualized separately. PyGraphviz Installation and Usage If you find the installation of PyGraphviz challenging, then don't worry, you are not alone. Recently, I faced numerous issues while installing on Ubuntu, starting from Python 3 incompatibility to incomplete documentation. To save your time, I have listed the steps that worked for me to reach a working setup. On Ubuntu, you will need the following packages installed to install PyGraphviz: $ sudo apt-get install python3.4-dev graphviz libgraphviz-dev pkg-config Now activate your virtual environment and run pip to install the development version of PyGraphviz directly from GitHub, which supports Python 3: $ pip install git+http://github.com/pygraphviz/pygraphviz.git#egg=pygraphviz Next, install django-extensions and add it to your INSTALLED_APPS. Now, you are all set. Here is a sample usage to create a GraphViz dot file for just two apps and to convert it to a PNG image for viewing: $ python manage.py graph_models app1 app2 > models.dot $ dot -Tpng models.dot -o models.png Incremental change or a full rewrite? Often, you would be handed over legacy code by the application owners in the earnest hope that most of it can be used right away or after a couple of minor tweaks. However, reading and understanding a huge and often outdated code base is not an easy job. Unsurprisingly, most programmers prefer to work on greenfield development. In the best case, the legacy code ought to be easily testable, well documented, and flexible to work in modern environments so that you can start making incremental changes in no time. In the worst case, you might recommend discarding the existing code and go for a full rewrite. Or, as it is commonly decided, the short-term approach would be to keep making incremental changes, and a parallel long-term effort might be underway for a complete reimplementation. A general rule of thumb to follow while taking such decisions is—if the cost of rewriting the application and maintaining the application is lower than the cost of maintaining the old application over time, then it is recommended to go for a rewrite. Care must be taken to account for all the factors, such as time taken to get new programmers up to speed, the cost of maintaining outdated hardware, and so on. Sometimes, the complexity of the application domain becomes a huge barrier against a rewrite, since a lot of knowledge learnt in the process of building the older code gets lost. Often, this dependency on the legacy code is a sign of poor design in the application like failing to externalize the business rules from the application logic. The worst form of a rewrite you can probably undertake is a conversion, or a mechanical translation from one language to another without taking any advantage of the existing best practices. In other words, you lost the opportunity to modernize the code base by removing years of cruft. Code should be seen as a liability not an asset. As counter-intuitive as it might sound, if you can achieve your business goals with a lesser amount of code, you have dramatically increased your productivity. Having less code to test, debug, and maintain can not only reduce ongoing costs but also make your organization more agile and flexible to change. Code is a liability not an asset. Less code is more maintainable. Irrespective of whether you are adding features or trimming your code, you must not touch your working legacy code without tests in place. Write tests before making any changes In the book Working Effectively with Legacy Code, Michael Feathers defines legacy code as, simply, code without tests. He elaborates that with tests one can easily modify the behavior of the code quickly and verifiably. In the absence of tests, it is impossible to gauge if the change made the code better or worse. Often, we do not know enough about legacy code to confidently write a test. Michael recommends writing tests that preserve and document the existing behavior, which are called characterization tests. Unlike the usual approach of writing tests, while writing a characterization test, you will first write a failing test with a dummy output, say X, because you don't know what to expect. When the test harness fails with an error, such as "Expected output X but got Y", then you will change your test to expect Y. So, now the test will pass, and it becomes a record of the code's existing behavior. Note that we might record buggy behavior as well. After all, this is unfamiliar code. Nevertheless, writing such tests are necessary before we start changing the code. Later, when we know the specifications and code better, we can fix these bugs and update our tests (not necessarily in that order). Step-by-step process to writing tests Writing tests before changing the code is similar to erecting scaffoldings before the restoration of an old building. It provides a structural framework that helps you confidently undertake repairs. You might want to approach this process in a stepwise manner as follows: Identify the area you need to make changes to. Write characterization tests focusing on this area until you have satisfactorily captured its behavior. Look at the changes you need to make and write specific test cases for those. Prefer smaller unit tests to larger and slower integration tests. Introduce incremental changes and test in lockstep. If tests break, then try to analyze whether it was expected. Don't be afraid to break even the characterization tests if that behavior is something that was intended to change. If you have a good set of tests around your code, then you can quickly find the effect of changing your code. On the other hand, if you decide to rewrite by discarding your code but not your data, then Django can help you considerably. Legacy databases There is an entire section on legacy databases in Django documentation and rightly so, as you will run into them many times. Data is more important than code, and databases are the repositories of data in most enterprises. You can modernize a legacy application written in other languages or frameworks by importing their database structure into Django. As an immediate advantage, you can use the Django admin interface to view and change your legacy data. Django makes this easy with the inspectdb management command, which looks as follows: $ python manage.py inspectdb > models.py This command, if run while your settings are configured to use the legacy database, can automatically generate the Python code that would go into your models file. Here are some best practices if you are using this approach to integrate to a legacy database: Know the limitations of Django ORM beforehand. Currently, multicolumn (composite) primary keys and NoSQL databases are not supported. Don't forget to manually clean up the generated models, for example, remove the redundant 'ID' fields since Django creates them automatically. Foreign Key relationships may have to be manually defined. In some databases, the auto-generated models will have them as integer fields (suffixed with _id). Organize your models into separate apps. Later, it will be easier to add the views, forms, and tests in the appropriate folders. Remember that running the migrations will create Django's administrative tables (django_* and auth_*) in the legacy database. In an ideal world, your auto-generated models would immediately start working, but in practice, it takes a lot of trial and error. Sometimes, the data type that Django inferred might not match your expectations. In other cases, you might want to add additional meta information such as unique_together to your model. Eventually, you should be able to see all the data that was locked inside that aging PHP application in your familiar Django admin interface. I am sure this will bring a smile to your face. Summary In this article, we looked at various techniques to understand legacy code. Reading code is often an underrated skill. But rather than reinventing the wheel, we need to judiciously reuse good working code whenever possible. Resources for Article: Further resources on this subject: So, what is Django? [article] Adding a developer with Django forms [article] Introduction to Custom Template Filters and Tags [article]
Read more
  • 0
  • 0
  • 2598
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-chain-responsibility-pattern
Packt
05 Feb 2015
12 min read
Save for later

The Chain of Responsibility Pattern

Packt
05 Feb 2015
12 min read
In this article by Sakis Kasampalis, author of the book Mastering Python Design Patterns, we will see a detailed description of the Chain of Responsibility design pattern with the help of a real-life example as well as a software example. Also, its use cases and implementation are discussed. (For more resources related to this topic, see here.) When developing an application, most of the time we know which method should satisfy a particular request in advance. However, this is not always the case. For example, we can think of any broadcast computer network, such as the original Ethernet implementation [j.mp/wikishared]. In broadcast computer networks, all requests are sent to all nodes (broadcast domains are excluded for simplicity), but only the nodes that are interested in a sent request process it. All computers that participate in a broadcast network are connected to each other using a common medium such as the cable that connects the three nodes in the following figure: If a node is not interested or does not know how to handle a request, it can perform the following actions: Ignore the request and do nothing Forward the request to the next node The way in which the node reacts to a request is an implementation detail. However, we can use the analogy of a broadcast computer network to understand what the chain of responsibility pattern is all about. The Chain of Responsibility pattern is used when we want to give a chance to multiple objects to satisfy a single request, or when we don't know which object (from a chain of objects) should process a specific request in advance. The principle is the same as the following: There is a chain (linked list, tree, or any other convenient data structure) of objects. We start by sending a request to the first object in the chain. The object decides whether it should satisfy the request or not. The object forwards the request to the next object. This procedure is repeated until we reach the end of the chain. At the application level, instead of talking about cables and network nodes, we can focus on objects and the flow of a request. The following figure, courtesy of a title="Scala for Machine Learning" www.sourcemaking.com [j.mp/smchain], shows how the client code sends a request to all processing elements (also known as nodes or handlers) of an application: Note that the client code only knows about the first processing element, instead of having references to all of them, and each processing element only knows about its immediate next neighbor (called the successor), not about every other processing element. This is usually a one-way relationship, which in programming terms means a singly linked list in contrast to a doubly linked list; a singly linked list does not allow navigation in both ways, while a doubly linked list allows that. This chain organization is used for a good reason. It achieves decoupling between the sender (client) and the receivers (processing elements) [GOF95, page 254]. A real-life example ATMs and, in general, any kind of machine that accepts/returns banknotes or coins (for example, a snack vending machine) use the chain of responsibility pattern. There is always a single slot for all banknotes, as shown in the following figure, courtesy of www.sourcemaking.com: When a banknote is dropped, it is routed to the appropriate receptacle. When it is returned, it is taken from the appropriate receptacle [j.mp/smchain], [j.mp/c2chain]. We can think of the single slot as the shared communication medium and the different receptacles as the processing elements. The result contains cash from one or more receptacles. For example, in the preceding figure, we see what happens when we request $175 from the ATM. A software example I tried to find some good examples of Python applications that use the Chain of Responsibility pattern but I couldn't, most likely because Python programmers don't use this name. So, my apologies, but I will use other programming languages as a reference. The servlet filters of Java are pieces of code that are executed before an HTTP request arrives at a target. When using servlet filters, there is a chain of filters. Each filter performs a different action (user authentication, logging, data compression, and so forth), and either forwards the request to the next filter until the chain is exhausted, or it breaks the flow if there is an error (for example, the authentication failed three consecutive times) [j.mp/soservl]. Apple's Cocoa and Cocoa Touch frameworks use Chain of Responsibility to handle events. When a view receives an event that it doesn't know how to handle, it forwards the event to its superview. This goes on until a view is capable of handling the event or the chain of views is exhausted [j.mp/chaincocoa]. Use cases By using the Chain of Responsibility pattern, we give a chance to a number of different objects to satisfy a specific request. This is useful when we don't know which object should satisfy a request in advance. An example is a purchase system. In purchase systems, there are many approval authorities. One approval authority might be able to approve orders up to a certain value, let's say $100. If the order is more than $100, the order is sent to the next approval authority in the chain that can approve orders up to $200, and so forth. Another case where Chain of Responsibility is useful is when we know that more than one object might need to process a single request. This is what happens in an event-based programming. A single event such as a left mouse click can be caught by more than one listener. It is important to note that the Chain of Responsibility pattern is not very useful if all the requests can be taken care of by a single processing element, unless we really don't know which element that is. The value of this pattern is the decoupling that it offers. Instead of having a many-to-many relationship between a client and all processing elements (and the same is true regarding the relationship between a processing element and all other processing elements), a client only needs to know how to communicate with the start (head) of the chain. The following figure demonstrates the difference between tight and loose coupling. The idea behind loosely coupled systems is to simplify maintenance and make it easier for us to understand how they function [j.mp/loosecoup]: Implementation There are many ways to implement Chain of Responsibility in Python, but my favorite implementation is the one by Vespe Savikko [j.mp/savviko]. Vespe's implementation uses dynamic dispatching in a Pythonic style to handle requests [j.mp/ddispatch]. Let's implement a simple event-based system using Vespe's implementation as a guide. The following is the UML class diagram of the system: The Event class describes an event. We'll keep it simple, so in our case an event has only name: class Event: def __init__(self, name): self.name = name def __str__(self): return self.name The Widget class is the core class of the application. The parent aggregation shown in the UML diagram indicates that each widget can have a reference to a parent object, which by convention, we assume is a Widget instance. Note, however, that according to the rules of inheritance, an instance of any of the subclasses of Widget (for example, an instance of MsgText) is also an instance of Widget. The default value of parent is None: class Widget: def __init__(self, parent=None): self.parent = parent The handle() method uses dynamic dispatching through hasattr() and getattr() to decide who is the handler of a specific request (event). If the widget that is asked to handle an event does not support it, there are two fallback mechanisms. If the widget has parent, then the handle() method of parent is executed. If the widget has no parent but a handle_default() method, handle_default() is executed: def handle(self, event): handler = 'handle_{}'.format(event) if hasattr(self, handler): method = getattr(self, handler) method(event) elif self.parent: self.parent.handle(event) elif hasattr(self, 'handle_default'): self.handle_default(event) At this point, you might have realized why the Widget and Event classes are only associated (no aggregation or composition relationships) in the UML class diagram. The association is used to show that the Widget class "knows" about the Event class but does not have any strict references to it, since an event needs to be passed only as a parameter to handle(). MainWIndow, MsgText, and SendDialog are all widgets with different behaviors. Not all these three widgets are expected to be able to handle the same events, and even if they can handle the same event, they might behave differently. MainWIndow can handle only the close and default events: class MainWindow(Widget): def handle_close(self, event): print('MainWindow: {}'.format(event)) def handle_default(self, event): print('MainWindow Default: {}'.format(event)) SendDialog can handle only the paint event: class SendDialog(Widget): def handle_paint(self, event): print('SendDialog: {}'.format(event)) Finally, MsgText can handle only the down event: class MsgText(Widget): def handle_down(self, event): print('MsgText: {}'.format(event)) The main() function shows how we can create a few widgets and events, and how the widgets react to those events. All events are sent to all the widgets. Note the parent relationship of each widget. The sd object (an instance of SendDialog) has as its parent the mw object (an instance of MainWindow). However, not all objects need to have a parent that is an instance of MainWindow. For example, the msg object (an instance of MsgText) has the sd object as a parent: def main(): mw = MainWindow() sd = SendDialog(mw) msg = MsgText(sd) for e in ('down', 'paint', 'unhandled', 'close'): evt = Event(e) print('nSending event -{}- to MainWindow'.format(evt)) mw.handle(evt) print('Sending event -{}- to SendDialog'.format(evt)) sd.handle(evt) print('Sending event -{}- to MsgText'.format(evt)) msg.handle(evt) The following is the full code of the example (chain.py): class Event: def __init__(self, name): self.name = name def __str__(self): return self.name class Widget: def __init__(self, parent=None): self.parent = parent def handle(self, event): handler = 'handle_{}'.format(event) if hasattr(self, handler): method = getattr(self, handler) method(event) elif self.parent: self.parent.handle(event) elif hasattr(self, 'handle_default'): self.handle_default(event) class MainWindow(Widget): def handle_close(self, event): print('MainWindow: {}'.format(event)) def handle_default(self, event): print('MainWindow Default: {}'.format(event)) class SendDialog(Widget): def handle_paint(self, event): print('SendDialog: {}'.format(event)) class MsgText(Widget): def handle_down(self, event): print('MsgText: {}'.format(event)) def main(): mw = MainWindow() sd = SendDialog(mw) msg = MsgText(sd) for e in ('down', 'paint', 'unhandled', 'close'): evt = Event(e) print('nSending event -{}- to MainWindow'.format(evt)) mw.handle(evt) print('Sending event -{}- to SendDialog'.format(evt)) sd.handle(evt) print('Sending event -{}- to MsgText'.format(evt)) msg.handle(evt) if __name__ == '__main__': main() Executing chain.py gives us the following results: >>> python3 chain.py Sending event -down- to MainWindow MainWindow Default: down Sending event -down- to SendDialog MainWindow Default: down Sending event -down- to MsgText MsgText: down Sending event -paint- to MainWindow MainWindow Default: paint Sending event -paint- to SendDialog SendDialog: paint Sending event -paint- to MsgText SendDialog: paint Sending event -unhandled- to MainWindow MainWindow Default: unhandled Sending event -unhandled- to SendDialog MainWindow Default: unhandled Sending event -unhandled- to MsgText MainWindow Default: unhandled Sending event -close- to MainWindow MainWindow: close Sending event -close- to SendDialog MainWindow: close Sending event -close- to MsgText MainWindow: close There are some interesting things that we can see in the output. For instance, sending a down event to MainWindow ends up being handled by the default MainWindow handler. Another nice case is that although a close event cannot be handled directly by SendDialog and MsgText, all the close events end up being handled properly by MainWindow. That's the beauty of using the parent relationship as a fallback mechanism. If you want to spend some more creative time on the event example, you can replace the dumb print statements and add some actual behavior to the listed events. Of course, you are not limited to the listed events. Just add your favorite event and make it do something useful! Another exercise is to add a MsgText instance during runtime that has MainWindow as the parent. Is this hard? Do the same for an event (add a new event to an existing widget). Which is harder? Summary In this article, we covered the Chain of Responsibility design pattern. This pattern is useful to model requests / handle events when the number and type of handlers isn't known in advance. Examples of systems that fit well with Chain of Responsibility are event-based systems, purchase systems, and shipping systems. In the Chain Of Responsibility pattern, the sender has direct access to the first node of a chain. If the request cannot be satisfied by the first node, it forwards to the next node. This continues until either the request is satisfied by a node or the whole chain is traversed. This design is used to achieve loose coupling between the sender and the receiver(s). ATMs are an example of Chain Of Responsibility. The single slot that is used for all banknotes can be considered the head of the chain. From here, depending on the transaction, one or more receptacles is used to process the transaction. The receptacles can be considered the processing elements of the chain. Java's servlet filters use the Chain of Responsibility pattern to perform different actions (for example, compression and authentication) on an HTTP request. Apple's Cocoa frameworks use the same pattern to handle events such as button presses and finger gestures. Resources for Article: Further resources on this subject: Exploring Model View Controller [Article] Analyzing a Complex Dataset [Article] Automating Your System Administration and Deployment Tasks Over SSH [Article]
Read more
  • 0
  • 0
  • 3727

article-image-middleware
Packt
30 Dec 2014
13 min read
Save for later

Middleware

Packt
30 Dec 2014
13 min read
In this article by Mario Casciaro, the author of the book, "Node.js Design Patterns", has described the importance of using a middleware pattern. One of the most distinctive patterns in Node.js is definitely middleware. Unfortunately it's also one of the most confusing for the inexperienced, especially for developers coming from the enterprise programming world. The reason for the disorientation is probably connected with the meaning of the term middleware, which in the enterprise architecture's jargon represents the various software suites that help to abstract lower level mechanisms such as OS APIs, network communications, memory management, and so on, allowing the developer to focus only on the business case of the application. In this context, the term middleware recalls topics such as CORBA, Enterprise Service Bus, Spring, JBoss, but in its more generic meaning it can also define any kind of software layer that acts like a glue between lower level services and the application (literally the software in the middle). (For more resources related to this topic, see here.) Middleware in Express Express (http://expressjs.com) popularized the term middleware in theNode.js world, binding it to a very specific design pattern. In express, in fact, a middleware represents a set of services, typically functions, that are organized in a pipeline and are responsible for processing incoming HTTP requests and relative responses. An express middleware has the following signature: function(req, res, next) { ... } Where req is the incoming HTTP request, res is the response, and next is the callback to be invoked when the current middleware has completed its tasks and that in turn triggers the next middleware in the pipeline. Examples of the tasks carried out by an express middleware are as the following: Parsing the body of the request Compressing/decompressing requests and responses Producing access logs Managing sessions Providing Cross-site Request Forgery (CSRF) protection If we think about it, these are all tasks that are not strictly related to the main functionality of an application, rather, they are accessories, components providing support to the rest of the application and allowing the actual request handlers to focus only on their main business logic. Essentially, those tasks are software in the middle. Middleware as a pattern The technique used to implement middleware in express is not new; in fact, it can be considered the Node.js incarnation of the Intercepting Filter pattern and the Chain of Responsibility pattern. In more generic terms, it also represents a processing pipeline,which reminds us about streams. Today, in Node.js, the word middleware is used well beyond the boundaries of the express framework, and indicates a particular pattern whereby a set of processing units, filters, and handlers, under the form of functions are connected to form an asynchronous sequence in order to perform preprocessing and postprocessing of any kind of data. The main advantage of this pattern is flexibility; in fact, this pattern allows us to obtain a plugin infrastructure with incredibly little effort, providing an unobtrusive way for extending a system with new filters and handlers. If you want to know more about the Intercepting Filter pattern, the following article is a good starting point: http://www.oracle.com/technetwork/java/interceptingfilter-142169.html. A nice overview of the Chain of Responsibility pattern is available at this URL: http://java.dzone.com/articles/design-patterns-uncovered-chain-of-responsibility. The following diagram shows the components of the middleware pattern: The essential component of the pattern is the Middleware Manager, which is responsible for organizing and executing the middleware functions. The most important implementation details of the pattern are as follows: New middleware can be registered by invoking the use() function (the name of this function is a common convention in many implementations of this pattern, but we can choose any name). Usually, new middleware can only be appended at the end of the pipeline, but this is not a strict rule. When new data to process is received, the registered middleware is invoked in an asynchronous sequential execution flow. Each unit in the pipeline receives in input the result of the execution of the previous unit. Each middleware can decide to stop further processing of the data by simply not invoking its callback or by passing an error to the callback. An error situation usually triggers the execution of another sequence of middleware that is specifically dedicated to handling errors. There is no strict rule on how the data is processed and propagated in the pipeline. The strategies include: Augmenting the data with additional properties or functions Replacing the data with the result of some kind of processing Maintaining the immutability of the data and always returning fresh copies as result of the processing The right approach that we need to take depends on the way the Middleware Manager is implemented and on the type of processing carried out by the middleware itself. Creating a middleware framework for ØMQ Let's now demonstrate the pattern by building a middleware framework around the ØMQ (http://zeromq.org) messaging library. ØMQ (also known as ZMQ, or ZeroMQ) provides a simple interface for exchanging atomic messages across the network using a variety of protocols; it shines for its performances, and its basic set of abstractions are specifically built to facilitate the implementation of custom messaging architectures. For this reason, ØMQ is often chosen to build complex distributed systems. The interface of ØMQ is pretty low-level, it only allows us to use strings and binary buffers for messages, so any encoding or custom formatting of data has to be implemented by the users of the library. In the next example, we are going to build a middleware infrastructure to abstract the preprocessing and postprocessing of the data passing through a ØMQ socket, so that we can transparently work with JSON objects but also seamlessly compress the messages traveling over the wire. Before continuing with the example, please make sure to install the ØMQ native libraries following the instructions at this URL: http://zeromq.org/intro:get-the-software. Any version in the 4.0 branch should be enough for working on this example. The Middleware Manager The first step to build a middleware infrastructure around ØMQ is to create a component that is responsible for executing the middleware pipeline when a new message is received or sent. For the purpose, let's create a new module called zmqMiddlewareManager.js and let's start defining it: function ZmqMiddlewareManager(socket) { this.socket = socket; this.inboundMiddleware = []; //[1] this.outboundMiddleware = []; var self = this; socket.on('message', function(message) { //[2] self.executeMiddleware(self.inboundMiddleware, { data: message }); }); } module.exports = ZmqMiddlewareManager; This first code fragment defines a new constructor for our new component. It accepts a ØMQ socket as an argument and: Creates two empty lists that will contain our middleware functions, one for the inbound messages and another one for the outbound messages. Immediately, it starts listening for the new messages coming from the socket by attaching a new listener to the message event. In the listener, we process the inbound message by executing the inboundMiddleware pipeline. The next method of the ZmqMiddlewareManager prototype is responsible for executing the middleware when a new message is sent through the socket: ZmqMiddlewareManager.prototype.send = function(data) { var self = this; var message = { data: data}; self.executeMiddleware(self.outboundMiddleware, message,    function() {    self.socket.send(message.data);    } ); } This time the message is processed using the filters in the outboundMiddleware list and then passed to socket.send() for the actual network transmission. Now, we need a small method to append new middleware functions to our pipelines; we already mentioned that such a method is conventionally called use(): ZmqMiddlewareManager.prototype.use = function(middleware) { if(middleware.inbound) {    this.inboundMiddleware.push(middleware.inbound); }if(middleware.outbound) {    this.outboundMiddleware.unshift(middleware.outbound); } } Each middleware comes in pairs; in our implementation it's an object that contains two properties, inbound and outbound, that contain the middleware functions to be added to the respective list. It's important to observe here that the inbound middleware is pushed to the end of the inboundMiddleware list, while the outbound middleware is inserted at the beginning of the outboundMiddleware list. This is because complementary inbound/outbound middleware functions usually need to be executed in an inverted order. For example, if we want to decompress and then deserialize an inbound message using JSON, it means that for the outbound, we should instead first serialize and then compress. It's important to understand that this convention for organizing the middleware in pairs is not strictly part of the general pattern, but only an implementation detail of our specific example. Now, it's time to define the core of our component, the function that is responsible for executing the middleware: ZmqMiddlewareManager.prototype.executeMiddleware = function(middleware, arg, finish) {var self = this;(    function iterator(index) {      if(index === middleware.length) {        return finish && finish();      }      middleware[index].call(self, arg, function(err) { if(err) {        console.log('There was an error: ' + err.message);      }      iterator(++index);    }); })(0); } The preceding code should look very familiar; in fact, it is a simple implementation of the asynchronous sequential iteration pattern. Each function in the middleware array received in input is executed one after the other, and the same arg object is provided as an argument to each middleware function; this is the trickthat makes it possible to propagate the data from one middleware to the next. At the end of the iteration, the finish() callback is invoked. Please note that for brevity we are not supporting an error middleware pipeline. Normally, when a middleware function propagates an error, another set of middleware specifically dedicated to handling errors is executed. This can be easily implemented using the same technique that we are demonstrating here. A middleware to support JSON messages Now that we have implemented our Middleware Manager, we can create a pair of middleware functions to demonstrate how to process inbound and outbound messages. As we said, one of the goals of our middleware infrastructure is having a filter that serializes and deserializes JSON messages, so let's create a new middleware to take care of this. In a new module called middleware.js; let's include the following code: module.exports.json = function() { return {    inbound: function(message, next) {      message.data = JSON.parse(message.data.toString());      next();    },    outbound: function(message, next) {      message.data = new Buffer(JSON.stringify(message.data));      next();    } } } The json middleware that we just created is very simple: The inbound middleware deserializes the message received as an input and assigns the result back to the data property of message, so that it can be further processed along the pipeline The outbound middleware serializes any data found into message.data Design Patterns Please note how the middleware supported by our framework is quite different from the one used in express; this is totally normal and a perfect demonstration of how we can adapt this pattern to fit our specific need. Using the ØMQ middleware framework We are now ready to use the middleware infrastructure that we just created. To do that, we are going to build a very simple application, with a client sending a ping to a server at regular intervals and the server echoing back the message received. From an implementation perspective, we are going to rely on a request/reply messaging pattern using the req/rep socket pair provided by ØMQ (http://zguide. zeromq.org/page:all#Ask-and-Ye-Shall-Receive). We will then wrap the socketswith our zmqMiddlewareManager to get all the advantages from the middleware infrastructure that we built, including the middleware for serializing/deserializing JSON messages. The server Let's start by creating the server side (server.js). In the first part of the module we initialize our components: var zmq = require('zmq'); var ZmqMiddlewareManager = require('./zmqMiddlewareManager'); var middleware = require('./middleware'); var reply = zmq.socket('rep'); reply.bind('tcp://127.0.0.1:5000'); In the preceding code, we loaded the required dependencies and bind a ØMQ 'rep' (reply) socket to a local port. Next, we initialize our middleware: var zmqm = new ZmqMiddlewareManager(reply); zmqm.use(middleware.zlib()); zmqm.use(middleware.json()); We created a new ZmqMiddlewareManager object and then added two middlewares, one for compressing/decompressing the messages and another one for parsing/ serializing JSON messages. For brevity, we did not show the implementation of the zlib middleware. Now we are ready to handle a request coming from the client, we will do this by simply adding another middleware, this time using it as a request handler: zmqm.use({ inbound: function(message, next) { console.log('Received: ',    message.data); if(message.data.action === 'ping') {     this.send({action: 'pong', echo: message.data.echo});  }    next(); } }); Since this last middleware is defined after the zlib and json middlewares, we can transparently use the decompressed and deserialized message that is available in the message.data variable. On the other hand, any data passed to send() will be processed by the outbound middleware, which in our case will serialize then compress the data. The client On the client side of our little application, client.js, we will first have to initiate a new ØMQ req (request) socket connected to the port 5000, the one used by our server: var zmq = require('zmq'); var ZmqMiddlewareManager = require('./zmqMiddlewareManager'); var middleware = require('./middleware'); var request = zmq.socket('req'); request.connect('tcp://127.0.0.1:5000'); Then, we need to set up our middleware framework in the same way that we did for the server: var zmqm = new ZmqMiddlewareManager(request); zmqm.use(middleware.zlib()); zmqm.use(middleware.json()); Next, we create an inbound middleware to handle the responses coming from the server: zmqm.use({ inbound: function(message, next) {    console.log('Echoed back: ', message.data);    next(); } }); In the preceding code, we simply intercept any inbound response and print it to the console. Finally, we set up a timer to send some ping requests at regular intervals, always using the zmqMiddlewareManager to get all the advantages of our middleware: setInterval(function() { zmqm.send({action: 'ping', echo: Date.now()}); }, 1000); We can now try our application by first starting the server: node server We can then start the client with the following command: node client At this point, we should see the client sending messages and the server echoing them back. Our middleware framework did its job; it allowed us to decompress/compress and deserialize/serialize our messages transparently, leaving the handlers free to focus on their business logic! Summary In this article, we learned about the middleware pattern and the various facets of the pattern, and we also saw how to create a middleware framework and how to use. Resources for Article:  Further resources on this subject: Selecting and initializing the database [article] Exploring streams [article] So, what is Node.js? [article]
Read more
  • 0
  • 0
  • 2727

article-image-performance-optimization
Packt
19 Dec 2014
30 min read
Save for later

Performance Optimization

Packt
19 Dec 2014
30 min read
In this article is written by Mark Kerzner and Sujee Maniyam, the authors of HBase Design Patterns, we will talk about how to write high performance and scalable HBase applications. In particular, will take a look at the following topics: The bulk loading of data into HBase Profiling HBase applications Tips to get good performance on writes Tips to get good performance on reads (For more resources related to this topic, see here.) Loading bulk data into HBase When deploying HBase for the first time, we usually need to import a significant amount of data. This is called initial loading or bootstrapping. There are three methods that can be used to import data into HBase, given as follows: Using the Java API to insert data into HBase. This can be done in a single client, using single or multiple threads. Using MapReduce to insert data in parallel (this approach also uses the Java API), as shown in the following diagram:  Using MapReduce to generate HBase store files in parallel in bulk and then import them into HBase directly. (This approach does not require the use of the API; it does not require code and is very efficient.)  On comparing the three methods speed wise, we have the following order: Java client < MapReduce insert < HBase file import The Java client and MapReduce use HBase APIs to insert data. MapReduce runs on multiple machines and can exploit parallelism. However, both of these methods go through the write path in HBase. Importing HBase files directly, however, skips the usual write path. HBase files already have data in the correct format that HBase understands. That's why importing them is much faster than using MapReduce and the Java client. We covered the Java API earlier. Let's start with how to insert data using MapReduce. Importing data into HBase using MapReduce MapReduce is the distributed processing engine of Hadoop. Usually, programs read/write data from HDFS. Luckily, HBase supports MapReduce. HBase can be the source and the sink for MapReduce programs. A source means MapReduce programs can read from HBase, and sink means results from MapReduce can be sent to HBase. The following diagram illustrates various sources and sinks for MapReduce:     The diagram we just saw can be summarized as follows: Scenario Source Sink Description 1 HDFS HDFS This is a typical MapReduce method that reads data from HDFS and also sends the results to HDFS. 2 HDFS HBase This imports the data from HDFS into HBase. It's a very common method that is used to import data into HBase for the first time. 3 HBase HBase Data is read from HBase and written to it. It is most likely that these will be two separate HBase clusters. It's usually used for backups and mirroring.  Importing data from HDFS into HBase Let's say we have lots of data in HDFS and want to import it into HBase. We are going to write a MapReduce program that reads from HDFS and inserts data into HBase. This is depicted in the second scenario in the table we just saw. Now, we'll be setting up the environment for the following discussion. In addition, you can find the code and the data for this discussion in our GitHub repository at https://github.com/elephantscale/hbase-book. The dataset we will use is the sensor data. Our (imaginary) sensor data is stored in HDFS as CSV (comma-separated values) text files. This is how their format looks: Sensor_id, max temperature, min temperature Here is some sample data: sensor11,90,70 sensor22,80,70 sensor31,85,72 sensor33,75,72 We have two sample files (sensor-data1.csv and sensor-data2.csv) in our repository under the /data directory. Feel free to inspect them. The first thing we have to do is copy these files into HDFS. Create a directory in HDFS as follows: $   hdfs   dfs -mkdir   hbase-import Now, copy the files into HDFS: $   hdfs   dfs   -put   sensor-data*   hbase-import/ Verify that the files exist as follows: $   hdfs   dfs -ls   hbase-import We are ready to insert this data into HBase. Note that we are designing the table to match the CSV files we are loading for ease of use. Our row key is sensor_id. We have one column family and we call it f (short for family). Now, we will store two columns, max temperature and min temperature, in this column family. Pig for MapReduce Pig allows you to write MapReduce programs at a very high level, and inserting data into HBase is just as easy. Here's a Pig script that reads the sensor data from HDFS and writes it in HBase: -- ## hdfs-to-hbase.pigdata = LOAD 'hbase-import/' using PigStorage(',') as (sensor_id:chararray, max:int, min:int);-- describe data;-- dump data; Now, store the data in hbase://sensors using the following line of code: org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:max,f:min'); After creating the table, in the first command, we will load data from the hbase-import directory in HDFS. The schema for the data is defined as follows: Sensor_id : chararray (string)max : intmin : int The describe and dump statements can be used to inspect the data; in Pig, describe will give you the structure of the data object you have, and dump will output all the data to the terminal. The final STORE command is the one that inserts the data into HBase. Let's analyze how it is structured: INTO 'hbase://sensors': This tells Pig to connect to the sensors HBase table. org.apache.pig.backend.hadoop.hbase.HBaseStorage: This is the Pig class that will be used to write in HBase. Pig has adapters for multiple data stores. The first field in the tuple, sensor_id, will be used as a row key. We are specifying the column names for the max and min fields (f:max and f:min, respectively). Note that we have to specify the column family (f:) to qualify the columns. Before running this script, we need to create an HBase table called sensors. We can do this from the HBase shell, as follows: $ hbase shell$ create 'sensors' , 'f'$ quit Then, run the Pig script as follows: $ pig hdfs-to-hbase.pig Now watch the console output. Pig will execute the script as a MapReduce job. Even though we are only importing two small files here, we can insert a fairly large amount of data by exploiting the parallelism of MapReduce. At the end of the run, Pig will print out some statistics: Input(s):Successfully read 7 records (591 bytes) from: "hdfs://quickstart.cloudera:8020/user/cloudera/hbase-import"Output(s):Successfully stored 7 records in: "hbase://sensors" Looks good! We should have seven rows in our HBase sensors table. We can inspect the table from the HBase shell with the following commands: $ hbase shell$ scan 'sensors' This is how your output might look: ROW                      COLUMN+CELL sensor11                 column=f:max, timestamp=1412373703149, value=90 sensor11                 column=f:min, timestamp=1412373703149, value=70 sensor22                 column=f:max, timestamp=1412373703177, value=80 sensor22                column=f:min, timestamp=1412373703177, value=70 sensor31                 column=f:max, timestamp=1412373703177, value=85 sensor31                 column=f:min, timestamp=1412373703177, value=72 sensor33                 column=f:max, timestamp=1412373703177, value=75 sensor33                 column=f:min, timestamp=1412373703177, value=72 sensor44                 column=f:max, timestamp=1412373703184, value=55 sensor44                 column=f:min, timestamp=1412373703184, value=42 sensor45                 column=f:max, timestamp=1412373703184, value=57 sensor45                 column=f:min, timestamp=1412373703184, value=47 sensor55                 column=f:max, timestamp=1412373703184, value=55 sensor55                 column=f:min, timestamp=1412373703184, value=427 row(s) in 0.0820 seconds There you go; you can see that seven rows have been inserted! With Pig, it was very easy. It took us just two lines of Pig script to do the import. Java MapReduce We have just demonstrated MapReduce using Pig, and you now know that Pig is a concise and high-level way to write MapReduce programs. This is demonstrated by our previous script, essentially the two lines of Pig code. However, there are situations where you do want to use the Java API, and it would make more sense to use it than using a Pig script. This can happen when you need Java to access Java libraries or do some other detailed tasks for which Pig is not a good match. For that, we have provided the Java version of the MapReduce code in our GitHub repository. Using HBase's bulk loader utility HBase is shipped with a bulk loader tool called ImportTsv that can import files from HDFS into HBase tables directly. It is very easy to use, and as a bonus, it uses MapReduce internally to process files in parallel. Perform the following steps to use ImportTsv: Stage data files into HDFS (remember that the files are processed using MapReduce). Create a table in HBase if required. Run the import. Staging data files into HDFS The first step to stage data files into HDFS has already been outlined in the previous section. The following sections explain the next two steps to stage data files. Creating an HBase table We will do this from the HBase shell. A note on regions is in order here. Regions are shards created automatically by HBase. It is the regions that are responsible for the distributed nature of HBase. However, you need to pay some attention to them in order to assure performance. If you put all the data in one region, you will cause what is called region hotspotting. What is especially nice about a bulk loader is that when creating a table, it lets you presplit the table into multiple regions. Precreating regions will allow faster imports (because the insert requests will go out to multiple region servers). Here, we are creating a single column family: $ hbase shellhbase> create 'sensors', {NAME => 'f'}, {SPLITS => ['sensor20', 'sensor40', 'sensor60']}0 row(s) in 1.3940 seconds=> Hbase::Table - sensors hbase > describe 'sensors'DESCRIPTION                                       ENABLED'sensors', {NAME => 'f', DATA_BLOCK_ENCODING => true'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE=> '0', VERSIONS => '1', COMPRESSION => 'NONE',MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}1 row(s) in 0.1140 seconds We are creating regions here. Why there are exactly four regions will be clear from the following diagram:   On inspecting the table in the HBase Master UI, we will see this. Also, you can see how Start Key and End Key, which we specified, are showing up. Run the import Ok, now it's time to insert data into HBase. To see the usage of ImportTsv, do the following: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv This will print the usage as follows: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min sensors   hbase-import/ The following table explains what the parameters mean: Parameter Description -Dimporttsv.separator Here, our separator is a comma (,). The default value is tab (t). -Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min This is where we map our input files into HBase tables. The first field, sensor_id, is our key, and we use HBASE_ROW_KEY to denote that the rest we are inserting into column family f. The second field, max temp, maps to f:max. The last field, min temp, maps to f:min. sensors This is the table name. hbase-import This is the HDFS directory where the data files are located.  When we run this command, we will see that a MapReduce job is being kicked off. This is how an import is parallelized. Also, from the console output, we can see that MapReduce is importing two files as follows: [main] mapreduce.JobSubmitter: number of splits:2 While the job is running, we can inspect the progress from YARN (or the JobTracker UI). One thing that we can note is that the MapReduce job only consists of mappers. This is because we are reading a bunch of files and inserting them into HBase directly. There is nothing to aggregate. So, there is no need for reducers. After the job is done, inspect the counters and we can see this: Map-Reduce Framework Map input records=7 Map output records=7 This tells us that mappers read seven records from the files and inserted seven records into HBase. Let's also verify the data in HBase: $   hbase shellhbase >   scan 'sensors'ROW                 COLUMN+CELLsensor11           column=f:max, timestamp=1409087465345, value=90sensor11           column=f:min, timestamp=1409087465345, value=70sensor22           column=f:max, timestamp=1409087465345, value=80sensor22           column=f:min, timestamp=1409087465345, value=70sensor31           column=f:max, timestamp=1409087465345, value=85sensor31           column=f:min, timestamp=1409087465345, value=72sensor33           column=f:max, timestamp=1409087465345, value=75sensor33           column=f:min, timestamp=1409087465345, value=72sensor44            column=f:max, timestamp=1409087465345, value=55sensor44           column=f:min, timestamp=1409087465345, value=42sensor45           column=f:max, timestamp=1409087465345, value=57sensor45           column=f:min, timestamp=1409087465345, value=47sensor55           column=f:max, timestamp=1409087465345, value=55sensor55           column=f:min, timestamp=1409087465345, value=427 row(s) in 2.1180 seconds Your output might vary slightly. We can see that seven rows are inserted, confirming the MapReduce counters! Let's take another quick look at the HBase UI, which is shown here:    As you can see, the inserts go to different regions. So, on a HBase cluster with many region servers, the load will be spread across the cluster. This is because we have presplit the table into regions. Here are some questions to test your understanding. Run the same ImportTsv command again and see how many records are in the table. Do you get duplicates? Try to find the answer and explain why that is the correct answer, then check these in the GitHub repository (https://github.com/elephantscale/hbase-book). Bulk import scenarios Here are a few bulk import scenarios: Scenario Methods Notes The data is already in HDFS and needs to be imported into HBase. The two methods that can be used to do this are as follows: If the ImportTsv tool can work for you, then use it as it will save time in writing custom MapReduce code. Sometimes, you might have to write a custom MapReduce job to import (for example, complex time series data, doing data mapping, and so on). It is probably a good idea to presplit the table before a bulk import. This spreads the insert requests across the cluster and results in a higher insert rate. If you are writing a custom MapReduce job, consider using a high-level MapReduce platform such as Pig or Hive. They are much more concise to write than the Java code. The data is in another database (RDBMs/NoSQL) and you need to import it into HBase. Use a utility such as Sqoop to bring the data into HDFS and then use the tools outlined in the first scenario. Avoid writing MapReduce code that directly queries databases. Most databases cannot handle many simultaneous connections. It is best to bring the data into Hadoop (HDFS) first and then use MapReduce. Profiling HBase applications Just like any software development process, once we have our HBase application working correctly, we would want to make it faster. At times, developers get too carried away and start optimizing before the application is finalized. There is a well-known rule that premature optimization is the root of all evil. One of the sources for this rule is Scott Meyers Effective C++. We can perform some ad hoc profiling in our code by timing various function calls. Also, we can use profiling tools to pinpoint the trouble spots. Using profiling tools is highly encouraged for the following reasons: Profiling takes out the guesswork (and a good majority of developers' guesses are wrong). There is no need to modify the code. Manual profiling means that we have to go and insert the instrumentation code all over the code. Profilers work by inspecting the runtime behavior. Most profilers have a nice and intuitive UI to visualize the program flow and time flow. The authors use JProfiler. It is a pretty effective profiler. However, it is neither free nor open source. So, for the purpose of this article, we are going to show you a simple manual profiling, as follows: public class UserInsert {      static String tableName = "users";    static String familyName = "info";      public static void main(String[] args) throws Exception {        Configuration config = HBaseConfiguration.create();        // change the following to connect to remote clusters        // config.set("hbase.zookeeper.quorum", "localhost");        long t1a = System.currentTimeMillis();        HTable htable = new HTable(config, tableName);        long t1b = System.currentTimeMillis();        System.out.println ("Connected to HTable in : " + (t1b-t1a) + " ms");        int total = 100;        long t2a = System.currentTimeMillis();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));            htable.put(put);          }        long t2b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t2b - t2a) + " ms");        htable.close();      } } The code we just saw inserts some sample user data into HBase. We are profiling two operations, that is, connection time and actual insert time. A sample run of the Java application yields the following: Connected to HTable in : 1139 msinserted 100 users in 350 ms We spent a lot of time in connecting to HBase. This makes sense. The connection process has to go to ZooKeeper first and then to HBase. So, it is an expensive operation. How can we minimize the connection cost? The answer is by using connection pooling. Luckily, for us, HBase comes with a connection pool manager. The Java class for this is HConnectionManager. It is very simple to use. Let's update our class to use HConnectionManager: Code : File name: hbase_dp.ch8.UserInsert2.java   package hbase_dp.ch8;   import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.HConnection; import org.apache.hadoop.hbase.client.HConnectionManager; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.HTableInterface; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.util.Bytes;   public class UserInsert2 {      static String tableName = "users";    static String familyName = "info";      public static void main(String[] args) throws Exception {        Configuration config = HBaseConfiguration.create();        // change the following to connect to remote clusters        // config.set("hbase.zookeeper.quorum", "localhost");               long t1a = System.currentTimeMillis();        HConnection hConnection = HConnectionManager.createConnection(config);        long t1b = System.currentTimeMillis();        System.out.println ("Connection manager in : " + (t1b-t1a) + " ms");          // simulate the first 'connection'        long t2a = System.currentTimeMillis();        HTableInterface htable = hConnection.getTable(tableName) ;        long t2b = System.currentTimeMillis();        System.out.println ("first connection in : " + (t2b-t2a) + " ms");               // second connection        long t3a = System.currentTimeMillis();        HTableInterface htable2 = hConnection.getTable(tableName) ;        long t3b = System.currentTimeMillis();        System.out.println ("second connection : " + (t3b-t3a) + " ms");          int total = 100;        long t4a = System.currentTimeMillis();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));            htable.put(put);          }      long t4b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms");        hConnection.close();    } } A sample run yields the following timings: Connection manager in : 98 ms first connection in : 808 ms second connection : 0 ms inserted 100 users in 393 ms The first connection takes a long time, but then take a look at the time of the second connection. It is almost instant ! This is cool! If you are connecting to HBase from web applications (or interactive applications), use connection pooling. More tips for high-performing HBase writes Here we will discuss some techniques and best practices to improve writes in HBase. Batch writes Currently, in our code, each time we call htable.put (one_put), we make an RPC call to an HBase region server. This round-trip delay can be minimized if we call htable.put() with a bunch of put records. Then, with one round trip, we can insert a bunch of records into HBase. This is called batch puts. Here is an example of batch puts. Only the relevant section is shown for clarity. For the full code, see hbase_dp.ch8.UserInsert3.java:        int total = 100;        long t4a = System.currentTimeMillis();        List<Put> puts = new ArrayList<>();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));                       puts.add(put); // just add to the list        }        htable.put(puts); // do a batch put        long t4b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms"); A sample run with a batch put is as follows: inserted 100 users in 48 ms The same code with individual puts took around 350 milliseconds! Use batch writes when you can to minimize latency. Note that the HTableUtil class that comes with HBase implements some smart batching options for your use and enjoyment. Setting memory buffers We can control when the puts are flushed by setting the client write buffer option. Once the data in the memory exceeds this setting, it is flushed to disk. The default setting is 2 M. Its purpose is to limit how much data is stored in the buffer before writing it to disk. There are two ways of setting this: In hbase-site.xml (this setting will be cluster-wide): <property>  <name>hbase.client.write.buffer</name>    <value>8388608</value>   <!-- 8 M --></property> In the application (only applies for that application): htable.setWriteBufferSize(1024*1024*10); // 10 Keep in mind that a bigger buffer takes more memory on both the client side and the server side. As a practical guideline, estimate how much memory you can dedicate to the client and put the rest of the load on the cluster. Turning off autofush If autoflush is enabled, each htable.put() object incurs a round trip RPC call to HRegionServer. Turning autoflush off can reduce the number of round trips and decrease latency. To turn it off, use this code: htable.setAutoFlush(false); The risk of turning off autoflush is if the client crashes before the data is sent to HBase, it will result in a data loss. Still, when will you want to do it? The answer is: when the danger of data loss is not important and speed is paramount. Also, see the batch write recommendations we saw previously. Turning off WAL Before we discuss this, we need to emphasize that the write-ahead log (WAL) is there to prevent data loss in the case of server crashes. By turning it off, we are bypassing this protection. Be very careful when choosing this. Bulk loading is one of the cases where turning off WAL might make sense. To turn off WAL, set it for each put: put.setDurability(Durability.SKIP_WAL); More tips for high-performing HBase reads So far, we looked at tips to write data into HBase. Now, let's take a look at some tips to read data faster. The scan cache When reading a large number of rows, it is better to set scan caching to a high number (in the 100 seconds or 1,000 seconds range). Otherwise, each row that is scanned will result in a trip to HRegionServer. This is especially encouraged for MapReduce jobs as they will likely consume a lot of rows sequentially. To set scan caching, use the following code: Scan scan = new Scan(); scan.setCaching(1000); Only read the families or columns needed When fetching a row, by default, HBase returns all the families and all the columns. If you only care about one family or a few attributes, specifying them will save needless I/O. To specify a family, use this: scan.addFamily( Bytes.toBytes("familiy1")); To specify columns, use this: scan.addColumn( Bytes.toBytes("familiy1"),   Bytes.toBytes("col1")) The block cache When scanning large rows sequentially (say in MapReduce), it is recommended that you turn off the block cache. Turning off the cache might be completely counter-intuitive. However, caches are only effective when we repeatedly access the same rows. During sequential scanning, there is no caching, and turning on the block cache will introduce a lot of churning in the cache (new data is constantly brought into the cache and old data is evicted to make room for the new data). So, we have the following points to consider: Turn off the block cache for sequential scans Turn off the block cache for random/repeated access Benchmarking or load testing HBase Benchmarking is a good way to verify HBase's setup and performance. There are a few good benchmarks available: HBase's built-in benchmark The Yahoo Cloud Serving Benchmark (YCSB) JMeter for custom workloads HBase's built-in benchmark HBase's built-in benchmark is PerformanceEvaluation. To find its usage, use this: $   hbase org.apache.hadoop.hbase.PerformanceEvaluation To perform a write benchmark, use this: $ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred randomWrite 5 Here we are using five threads and no MapReduce. To accurately measure the throughput, we need to presplit the table that the benchmark writes to. It is TestTable. $ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --presplit=3 randomWrite 5 Here, the table is split in three ways. It is good practice to split the table into as many regions as the number of region servers. There is a read option along with a whole host of scan options. YCSB The YCSB is a comprehensive benchmark suite that works with many systems such as Cassandra, Accumulo, and HBase. Download it from GitHub, as follows: $   git clone git://github.com/brianfrankcooper/YCSB.git Build it like this: $ mvn -DskipTests package Create an HBase table to test against: $ hbase shellhbase> create 'ycsb', 'f1' Now, copy hdfs-site.xml for your cluster into the hbase/src/main/conf/ directory and run the benchmark: $ bin/ycsb load hbase -P workloads/workloada -p columnfamily=f1 -p table=ycsb YCSB offers lots of workloads and options. Please refer to its wiki page at https://github.com/brianfrankcooper/YCSB/wiki. JMeter for custom workloads The standard benchmarks will give you an idea of your HBase cluster's performance. However, nothing can substitute measuring your own workload. We want to measure at least the insert speed or the query speed. We also want to run a stress test. So, we can measure the ceiling on how much our HBase cluster can support. We can do a simple instrumentation as we did earlier too. However, there are tools such as JMeter that can help us with load testing. Please refer to the JMeter website and check out the Hadoop or HBase plugins for JMeter. Monitoring HBase Running any distributed system involves decent monitoring. HBase is no exception. Luckily, HBase has the following capabilities: HBase exposes a lot of metrics These metrics can be directly consumed by monitoring systems such as Ganglia We can also obtain these metrics in the JSON format via the REST interface and JMX Monitoring is a big subject and we consider it as part HBase administration. So, in this section, we will give pointers to tools and utilities that allow you to monitor HBase. Ganglia Ganglia is a generic system monitor that can monitor hosts (such as CPU, disk usage, and so on). The Hadoop stack has had a pretty good integration with Ganglia for some time now. HBase and Ganglia integration is set up by modern installers from Cloudera and Hortonworks. To enable Ganglia metrics, update the hadoop-metrics.properties file in the HBase configuration directory. Here's a sample file: hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 hbase.period=10 hbase.servers=ganglia-server:PORT jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=ganglia-server:PORT rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=ganglia-server:PORT This file has to be uploaded to all the HBase servers (master servers as well as region servers). Here are some sample graphs from Ganglia (these are Wikimedia statistics, for example): These graphs show cluster-wide resource utilization. OpenTSDB OpenTSDB is a scalable time series database. It can collect and visualize metrics on a large scale. OpenTSDB uses collectors, light-weight agents that send metrics to the open TSDB server to collect metrics, and there is a collector library that can collect metrics from HBase. You can see all the collectors at http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html. An interesting factoid is that OpenTSDB is built on Hadoop/HBase. Collecting metrics via the JMX interface HBase exposes a lot of metrics via JMX. This page can be accessed from the web dashboard at http://<hbase master>:60010/jmx. For example, for a HBase instance that is running locally, it will be http://localhost:60010/jmx. Here is a sample screenshot of the JMX metrics via the web UI: Here's a quick example of how to programmatically retrieve these metrics using curl: $ curl 'localhost:60010/jmx' Since this is a web service, we can write a script/application in any language (Java, Python, or Ruby) to retrieve and inspect the metrics. Summary In this article, you learned how to push the performance of our HBase applications up. We looked at how to effectively load a large amount of data into HBase. You also learned about benchmarking and monitoring HBase and saw tips on how to do high-performing reads/writes. Resources for Article:   Further resources on this subject: The HBase's Data Storage [article] Hadoop and HDInsight in a Heartbeat [article] Understanding the HBase Ecosystem [article]
Read more
  • 0
  • 0
  • 3712

article-image-function-passing
Packt
19 Nov 2014
6 min read
Save for later

Function passing

Packt
19 Nov 2014
6 min read
In this article by Simon Timms, the author of the book, Mastering JavaScript Design Patterns, we will cover function passing. In functional programming languages, functions are first-class citizens. Functions can be assigned to variables and passed around just like you would with any other variable. This is not entirely a foreign concept. Even languages such as C had function pointers that could be treated just like other variables. C# has delegates and, in more recent versions, lambdas. The latest release of Java has also added support for lambdas, as they have proven to be so useful. (For more resources related to this topic, see here.) JavaScript allows for functions to be treated as variables and even as objects and strings. In this way, JavaScript is functional in nature. Because of JavaScript's single-threaded nature, callbacks are a common convention and you can find them pretty much everywhere. Consider calling a function at a later date on a web page. This is done by setting a timeout on the window object as follows: setTimeout(function(){alert("Hello from the past")}, 5 * 1000); The arguments for the set timeout function are a function to call and a time to delay in milliseconds. No matter the JavaScript environment in which you're working, it is almost impossible to avoid functions in the shape of callbacks. The asynchronous processing model of Node.js is highly dependent on being able to call a function and pass in something to be completed at a later date. Making calls to external resources in a browser is also dependent on a callback to notify the caller that some asynchronous operation has completed. In basic JavaScript, this looks like the following code: var xmlhttp = new XMLHttpRequest()xmlhttp.onreadystatechange=function()if (xmlhttp.readyState==4 &&xmlhttp.status==200){//process returned data}};xmlhttp.open("GET", http://some.external.resource, true); xmlhttp.send(); You may notice that we assign onreadystatechange before we even send the request. This is because assigning it later may result in a race condition in which the server responds before the function is attached to the ready state change. In this case, we've used an inline function to process the returned data. Because functions are first class citizens, we can change this to look like the following code: var xmlhttp;function requestData(){xmlhttp = new XMLHttpRequest()xmlhttp.onreadystatechange=processData;xmlhttp.open("GET", http://some.external.resource, true); xmlhttp.send();}function processData(){if (xmlhttp.readyState==4 &&xmlhttp.status==200){   //process returned data}} This is typically a cleaner approach and avoids performing complex processing in line with another function. However, you might be more familiar with the jQuery version of this, which looks something like this: $.getJSON('http://some.external.resource', function(json){//process returned data}); In this case, the boiler plate of dealing with ready state changes is handled for you. There is even convenience provided for you should the request for data fail with the following code: $.ajax('http://some.external.resource',{ success: function(json){   //process returned data},error: function(){   //process failure},dataType: "json"}); In this case, we've passed an object into the ajax call, which defines a number of properties. Amongst these properties are function callbacks for success and failure. This method of passing numerous functions into another suggests a great way of providing expansion points for classes. Likely, you've seen this pattern in use before without even realizing it. Passing functions into constructors as part of an options object is a commonly used approach to providing extension hooks in JavaScript libraries. Implementation In Westeros, the tourism industry is almost nonextant. There are great difficulties with bandits killing tourists and tourists becoming entangled in regional conflicts. Nonetheless, some enterprising folks have started to advertise a grand tour of Westeros in which they will take those with the means on a tour of all the major attractions. From King's Landing to Eyrie, to the great mountains of Dorne, the tour will cover it all. In fact, a rather mathematically inclined member of the tourism board has taken to calling it a Hamiltonian tour, as it visits everywhere once. The HamiltonianTour class provides an options object that allows the definition of an options object. This object contains the various places to which a callback can be attached. In our case, the interface for it would look something like the following code: export class HamiltonianTourOptions{onTourStart: Function;onEntryToAttraction: Function;onExitFromAttraction: Function;onTourCompletion: Function;} The full HamiltonianTour class looks like the following code: var HamiltonianTour = (function () {function HamiltonianTour(options) {   this.options = options;}HamiltonianTour.prototype.StartTour = function () {   if (this.options.onTourStart&&typeof (this.options.onTourStart)    === "function")   this.options.onTourStart();   this.VisitAttraction("King's Landing");   this.VisitAttraction("Winterfell");   this.VisitAttraction("Mountains of Dorne");   this.VisitAttraction("Eyrie");   if (this.options.onTourCompletion&&typeof    (this.options.onTourCompletion) === "function")   this.options.onTourCompletion();}; HamiltonianTour.prototype.VisitAttraction = function (AttractionName) {   if (this.options.onEntryToAttraction&&typeof    (this.options.onEntryToAttraction) === "function")   this.options.onEntryToAttraction(AttractionName);    //do whatever one does in a Attraction   if (this.options.onExitFromAttraction&&typeof    (this.options.onExitFromAttraction) === "function")   this.options.onExitFromAttraction(AttractionName);};return HamiltonianTour;})(); You can see in the highlighted code how we check the options and then execute the callback as needed. This can be done by simply using the following code: var tour = new HamiltonianTour({onEntryToAttraction: function(cityname){console.log("I'm delighted to be in " + cityname)}});tour.StartTour(); The output of the preceding code will be: I'm delighted to be in King's LandingI'm delighted to be in WinterfellI'm delighted to be in Mountains of DorneI'm delighted to be in Eyrie Summary In this article, we have learned about function passing. Passing functions is a great approach to solving a number of problems in JavaScript and tends to be used extensively by libraries such as jQuery and frameworks such as Express. It is so commonly adopted that using it provides to added barriers no your code's readability. Resources for Article: Further resources on this subject: Creating Java EE Applications [article] Meteor.js JavaScript Framework: Why Meteor Rocks! [article] Dart with JavaScript [article]
Read more
  • 0
  • 0
  • 2341
article-image-redis-autosuggest
Packt
18 Sep 2014
8 min read
Save for later

Redis in Autosuggest

Packt
18 Sep 2014
8 min read
In this article by Arun Chinnachamy, the author of Redis Applied Design Patterns, we are going to see how to use Redis to build a basic autocomplete or autosuggest server. Also, we will see how to build a faceting engine using Redis. To build such a system, we will use sorted sets and operations involving ranges and intersections. To summarize, we will focus on the following topics in this article: (For more resources related to this topic, see here.) Autocompletion for words Multiword autosuggestion using a sorted set Faceted search using sets and operations such as union and intersection Autosuggest systems These days autosuggest is seen in virtually all e-commerce stores in addition to a host of others. Almost all websites are utilizing this functionality in one way or another from a basic website search to programming IDEs. The ease of use afforded by autosuggest has led every major website from Google and Amazon to Wikipedia to use this feature to make it easier for users to navigate to where they want to go. The primary metric for any autosuggest system is how fast we can respond with suggestions to a user's query. Usability research studies have found that the response time should be under a second to ensure that a user's attention and flow of thought are preserved. Redis is ideally suited for this task as it is one of the fastest data stores in the market right now. Let's see how to design such a structure and use Redis to build an autosuggest engine. We can tweak Redis to suit individual use case scenarios, ranging from the simple to the complex. For instance, if we want only to autocomplete a word, we can enable this functionality by using a sorted set. Let's see how to perform single word completion and then we will move on to more complex scenarios, such as phrase completion. Word completion in Redis In this section, we want to provide a simple word completion feature through Redis. We will use a sorted set for this exercise. The reason behind using a sorted set is that it always guarantees O(log(N)) operations. While it is commonly known that in a sorted set, elements are arranged based on the score, what is not widely acknowledged is that elements with the same scores are arranged lexicographically. This is going to form the basis for our word completion feature. Let's look at a scenario in which we have the words to autocomplete: jack, smith, scott, jacob, and jackeline. In order to complete a word, we need to use n-gram. Every word needs to be written as a contiguous sequence. n-gram is a contiguous sequence of n items from a given sequence of text or speech. To find out more, check http://en.wikipedia.org/wiki/N-gram. For example, n-gram of jack is as follows: j ja jac jack$ In order to signify the completed word, we can use a delimiter such as * or $. To add the word into a sorted set, we will be using ZADD in the following way: > zadd autocomplete 0 j > zadd autocomplete 0 ja > zadd autocomplete 0 jac > zadd autocomplete 0 jack$ Likewise, we need to add all the words we want to index for autocompletion. Once we are done, our sorted set will look as follows: > zrange autocomplete 0 -1 1) "j" 2) "ja" 3) "jac" 4) "jack$" 5) "jacke" 6) "jackel" 7) "jackeli" 8) "jackelin" 9) "jackeline$" 10) "jaco" 11) "jacob$" 12) "s" 13) "sc" 14) "sco" 15) "scot" 16) "scott$" 17) "sm" 18) "smi" 19) "smit" 20) "smith$" Now, we will use ZRANK and ZRANGE operations over the sorted set to achieve our desired functionality. To autocomplete for ja, we have to execute the following commands: > zrank autocomplete jac 2 zrange autocomplete 3 50 1) "jack$" 2) "jacke" 3) "jackel" 4) "jackeli" 5) "jackelin" 6) "jackeline$" 7) "jaco" 8) "jacob$" 9) "s" 10) "sc" 11) "sco" 12) "scot" 13) "scott$" 14) "sm" 15) "smi" 16) "smit" 17) "smith$" Another example on completing smi is as follows: zrank autocomplete smi 17 zrange autocomplete 18 50 1) "smit" 2) "smith$" Now, in our program, we have to do the following tasks: Iterate through the results set. Check if the word starts with the query and only use the words with $ as the last character. Though it looks like a lot of operations are performed, both ZRANGE and ZRANK are O(log(N)) operations. Therefore, there should be virtually no problem in handling a huge list of words. When it comes to memory usage, we will have n+1 elements for every word, where n is the number of characters in the word. For M words, we will have M(avg(n) + 1) records where avg(n) is the average characters in a word. The more the collision of characters in our universe, the less the memory usage. In order to conserve memory, we can use the EXPIRE command to expire unused long tail autocomplete terms. Multiword phrase completion In the previous section, we have seen how to use the autocomplete for a single word. However, in most real world scenarios, we will have to deal with multiword phrases. This is much more difficult to achieve as there are a few inherent challenges involved: Suggesting a phrase for all matching words. For instance, the same manufacturer has a lot of models available. We have to ensure that we list all models if a user decides to search for a manufacturer by name. Order the results based on overall popularity and relevance of the match instead of ordering lexicographically. The following screenshot shows the typical autosuggest box, which you find in popular e-commerce portals. This feature improves the user experience and also reduces the spell errors: For this case, we will use a sorted set along with hashes. We will use a sorted set to store the n-gram of the indexed data followed by getting the complete title from hashes. Instead of storing the n-grams into the same sorted set, we will store them in different sorted sets. Let's look at the following scenario in which we have model names of mobile phones along with their popularity: For this set, we will create multiple sorted sets. Let's take Apple iPhone 5S: ZADD a 9 apple_iphone_5s ZADD ap 9 apple_iphone_5s ZADD app 9 apple_iphone_5s ZADD apple 9 apple_iphone_5s ZADD i 9 apple_iphone_5s ZADD ip 9 apple_iphone_5s ZADD iph 9 apple_iphone_5s ZADD ipho 9 apple_iphone_5s ZADD iphon 9 apple_iphone_5s ZADD iphone 9 apple_iphone_5s ZADD 5 9 apple_iphone_5s ZADD 5s 9 apple_iphone_5s HSET titles apple_iphone_5s "Apple iPhone 5S" In the preceding scenario, we have added every n-gram value as a sorted set and created a hash that holds the original title. Likewise, we have to add all the titles into our index. Searching in the index Now that we have indexed the titles, we are ready to perform a search. Consider a situation where a user is querying with the term apple. We want to show the user the five best suggestions based on the popularity of the product. Here's how we can achieve this: > zrevrange apple 0 4 withscores 1) "apple_iphone_5s" 2) 9.0 3) "apple_iphone_5c" 4) 6.0 As the elements inside the sorted set are ordered by the element score, we get the matches ordered by the popularity which we inserted. To get the original title, type the following command: > hmget titles apple_iphone_5s 1) "Apple iPhone 5S" In the preceding scenario case, the query was a single word. Now imagine if the user has multiple words such as Samsung nex, and we have to suggest the autocomplete as Samsung Galaxy Nexus. To achieve this, we will use ZINTERSTORE as follows: > zinterstore samsung_nex 2 samsung nex aggregate max ZINTERSTORE destination numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE SUM|MIN|MAX] This computes the intersection of sorted sets given by the specified keys and stores the result in a destination. It is mandatory to provide the number of input keys before passing the input keys and other (optional) arguments. For more information about ZINTERSTORE, visit http://redis.io/commands/ZINTERSTORE. The previous command, which is zinterstore samsung_nex 2 samsung nex aggregate max, will compute the intersection of two sorted sets, samsung and nex, and stores it in another sorted set, samsung_nex. To see the result, type the following commands: > zrevrange samsung_nex 0 4 withscores 1) samsung_galaxy_nexus 2) 7 > hmget titles samsung_galaxy_nexus 1) Samsung Galaxy Nexus If you want to cache the result for multiword queries and remove it automatically, use an EXPIRE command and set expiry for temporary keys. Summary In this article, we have seen how to perform autosuggest and faceted searches using Redis. We have also understood how sorted sets and sets work. We have also seen how Redis can be used as a backend system for simple faceting and autosuggest system and make the system ultrafast. Further resources on this subject: Using Redis in a hostile environment (Advanced) [Article] Building Applications with Spring Data Redis [Article] Implementing persistence in Redis (Intermediate) [Article] Resources used for creating the article: Credit for the featured tiger image: Big Cat Facts - Tiger
Read more
  • 0
  • 0
  • 3241

article-image-index-item-sharding-and-projection-dynamodb
Packt
17 Sep 2014
13 min read
Save for later

Index, Item Sharding, and Projection in DynamoDB

Packt
17 Sep 2014
13 min read
Understanding the secondary index and projections should go hand in hand because of the fact that a secondary index cannot be used efficiently without specifying projection. In this article by Uchit Vyas and Prabhakaran Kuppusamy, authors of DynamoDB Applied Design Patterns, we will take a look at local and global secondary indexes, and projection and its usage with indexes. (For more resources related to this topic, see here.) The use of projection in DynamoDB is pretty much similar to that of traditional databases. However, here are a few things to watch out for: Whenever a DynamoDB table is created, it is mandatory to create a primary key, which can be of a simple type (hash type), or it can be of a complex type (hash and range key). For the specified primary key, an index will be created (we call this index the primary index). Along with this primary key index, the user is allowed to create up to five secondary indexes per table. There are two kinds of secondary index. The first is a local secondary index (in which the hash key of the index must be the same as that of the table) and the second is the global secondary index (in which the hash key can be any field). In both of these secondary index types, the range key can be a field that the user needs to create an index for. Secondary indexes A quick question: while writing a query in any database, keeping the primary key field as part of the query (especially in the where condition) will return results much faster compared to the other way. Why? This is because of the fact that an index will be created automatically in most of the databases for the primary key field. This the case with DynamoDB also. This index is called the primary index of the table. There is no customization possible using the primary index, so the primary index is seldom discussed. In order to make retrieval faster, the frequently-retrieved attributes need to be made as part of the index. However, a DynamoDB table can have only one primary index and the index can have a maximum of two attributes (hash and range key). So for faster retrieval, the user should be given privileges to create user-defined indexes. This index, which is created by the user, is called the secondary index. Similar to the table key schema, the secondary index also has a key schema. Based on the key schema attributes, the secondary index can be either a local or global secondary index. Whenever a secondary index is created, during every item insertion, the items in the index will be rearranged. This rearrangement will happen for each item insertion into the table, provided the item contains both the index's hash and range key attribute. Projection Once we have an understanding of the secondary index, we are all set to learn about projection. While creating the secondary index, it is mandatory to specify the hash and range attributes based on which the index is created. Apart from these two attributes, if the query wants one or more attribute (assuming that none of these attributes are projected into the index), then DynamoDB will scan the entire table. This will consume a lot of throughput capacity and will have comparatively higher latency. The following is the table (with some data) that is used to store book information: Here are few more details about the table: The BookTitle attribute is the hash key of the table and local secondary index The Edition attribute is the range key of the table The PubDate attribute is the range key of the index (let's call this index IDX_PubDate) Local secondary index While creating the secondary index, the hash and range key of the table and index will be inserted into the index; optionally, the user can specify what other attributes need to be added. There are three kinds of projection possible in DynamoDB: KEYS_ONLY: Using this, the index consists of the hash and range key values of the table and index INCLUDE: Using this, the index consists of attributes in KEYS_ONLY plus other non-key attributes that we specify ALL: Using this, the index consists of all of the attributes from the source table The following code shows the creation of a local secondary index named Idx_PubDate with BookTitle as the hash key (which is a must in the case of a local secondary index), PubDate as the range key, and using the KEYS_ONLY projection: private static LocalSecondaryIndex getLocalSecondaryIndex() { ArrayList<KeySchemaElement> indexKeySchema =    newArrayList<KeySchemaElement>(); indexKeySchema.add(new KeySchemaElement()    .withAttributeName("BookTitle")    .withKeyType(KeyType.HASH)); indexKeySchema.add(new KeySchemaElement()    .withAttributeName("PubDate")    .withKeyType(KeyType.RANGE)); LocalSecondaryIndex lsi = new LocalSecondaryIndex()    .withIndexName("Idx_PubDate")    .withKeySchema(indexKeySchema)    .withProjection(new Projection()    .withProjectionType("KEYS_ONLY")); return lsi; } The usage of the KEYS_ONLY index type will create the smallest possible index and the usage of ALL will create the biggest possible index. We will discuss the trade-offs between these index types a little later. Going back to our example, let us assume that we are using the KEYS_ONLY index type, so none of the attributes (other than the previous three key attributes) are projected into the index. So the index will look as follows: You may notice that the row order of the index is almost the same as that of the table order (except the second and third rows). Here, you can observe one point: the table records will be grouped primarily based on the hash key, and then the records that have the same hash key will be ordered based on the range key of the index. In the case of the index, even though the table's range key is part of the index attribute, it will not play any role in the ordering (only the index's hash and range keys will take part in the ordering). There is a negative in this approach. If the user is writing a query using this index to fetch BookTitle and Publisher with PubDate as 28-Dec-2008, then what happens? Will DynamoDB complain that the Publisher attribute is not projected into the index? The answer is no. The reason is that even though Publisher is not projected into the index, we can still retrieve it using the secondary index. However, retrieving a nonprojected attribute will scan the entire table. So if we are sure that certain attributes need to be fetched frequently, then we must project it into the index; otherwise, it will consume a large number of capacity units and retrieval will be much slower as well. One more question: if the user is writing a query using the local secondary index to fetch BookTitle and Publisher with PubDate as 28-Dec-2008, then what happens? Will DynamoDB complain that the PubDate attribute is not part of the primary key and hence queries are not allowed on nonprimary key attributes? The answer is no. It is a rule of thumb that we can write queries on the secondary index attributes. It is possible to include nonprimary key attributes as part of the query, but these attributes must at least be key attributes of the index. The following code shows how to add non-key attributes to the secondary index's projection: private static Projection getProjectionWithNonKeyAttr() { Projection projection = new Projection()    .withProjectionType(ProjectionType.INCLUDE); ArrayList<String> nonKeyAttributes = new ArrayList<String>(); nonKeyAttributes.add("Language"); nonKeyAttributes.add("Author2"); projection.setNonKeyAttributes(nonKeyAttributes); return projection; } There is a slight limitation with the local secondary index. If we write a query on a non-key (both table and index) attribute, then internally DynamoDB might need to scan the entire table; this is inefficient. For example, consider a situation in which we need to retrieve the number of editions of the books in each and every language. Since both of the attributes are non-key, even if we create a local secondary index with either of the attributes (Edition and Language), the query will still result in a scan operation on the entire table. Global secondary index A problem arises here: is there any way in which we can create a secondary index using both the index keys that are different from the table's primary keys? The answer is the global secondary index. The following code shows how to create the global secondary index for this scenario: private static GlobalSecondaryIndex getGlobalSecondaryIndex() { GlobalSecondaryIndex gsi = new GlobalSecondaryIndex()    .withIndexName("Idx_Pub_Edtn")    .withProvisionedThroughput(new ProvisionedThroughput()    .withReadCapacityUnits((long) 1)    .withWriteCapacityUnits((long) 1))    .withProjection(newProjection().withProjectionType      ("KEYS_ONLY"));   ArrayList<KeySchemaElement> indexKeySchema1 =    newArrayList<KeySchemaElement>();   indexKeySchema1.add(new KeySchemaElement()    .withAttributeName("Language")    .withKeyType(KeyType.HASH)); indexKeySchema1.add(new KeySchemaElement()    .withAttributeName("Edition")    .withKeyType(KeyType.RANGE));   gsi.setKeySchema(indexKeySchema1); return gsi; } While deciding the attributes to be projected into a global secondary index, there are trade-offs we must consider between provisioned throughput and storage costs. A few of these are listed as follows: If our application doesn't need to query a table so often and it performs frequent writes or updates against the data in the table, then we must consider projecting the KEYS_ONLY attributes. The global secondary index will be minimum size, but it will still be available when required for the query activity. The smaller the index, the cheaper the cost to store it and our write costs will be cheaper too. If we need to access only those few attributes that have the lowest possible latency, then we must project only those (lesser) attributes into a global secondary index. If we need to access almost all of the non-key attributes of the DynamoDB table on a frequent basis, we can project these attributes (even the entire table) into the global secondary index. This will give us maximum flexibility with the trade-off that our storage cost would increase, or even double if we project the entire table's attributes into the index. The additional storage costs to store the global secondary index might equalize the cost of performing frequent table scans. If our application will frequently retrieve some non-key attributes, we must consider projecting these non-key attributes into the global secondary index. Item sharding Sharding, also called horizontal partitioning, is a technique in which rows are distributed among the database servers to perform queries faster. In the case of sharding, a hash operation will be performed on the table rows (mostly on one of the columns) and, based on the hash operation output, the rows will be grouped and sent to the proper database server. Take a look at the following diagram: As shown in the previous diagram, if all the table data (only four rows and one column are shown for illustration purpose) is stored in a single database server, the read and write operations will become slower and the server that has the frequently accessed table data will work more compared to the server storing the table data that is not accessed frequently. The following diagram shows the advantage of sharding over a multitable, multiserver database environment: In the previous diagram, two tables (Tbl_Places and Tbl_Sports) are shown on the left-hand side with four sample rows of data (Austria.. means only the first column of the first item is illustrated and all other fields are represented by ..).We are going to perform a hash operation on the first column only. In DynamoDB, this hashing will be performed automatically. Once the hashing is done, similar hash rows will be saved automatically in different servers (if necessary) to satisfy the specified provisioned throughput capacity. Have you ever wondered about the importance of the hash type key while creating a table (which is mandatory)? Of course we all know the importance of the range key and what it does. It simply sorts items based on the range key value. So far, we might have been thinking that the range key is more important than the hash key. If you think that way, then you may be correct, provided we neither need our table to be provisioned faster nor do we need to create any partitions for our table. As long as the table data is smaller, the importance of the hash key will be realized only while writing a query operation. However, once the table grows, in order to satisfy the same provision throughput, DynamoDB needs to partition our table data based on this hash key (as shown in the previous diagram). This partitioning of table items based on the hash key attribute is called sharding. It means the partitions are created by splitting items and not attributes. This is the reason why a query that has the hash key (of table and index) retrieves items much faster. Since the number of partitions is managed automatically by DynamoDB, we cannot just hope for things to work fine. We also need to keep certain things in mind, for example, the hash key attribute should have more distinct values. To simplify, it is not advisable to put binary values (such as Yes or No, Present or Past or Future, and so on) into the hash key attributes, thereby restricting the number of partitions. If our hash key attribute has either Yes or No values in all the items, then DynamoDB can create only a maximum of two partitions; therefore, the specified provisioned throughput cannot be achieved. Just consider that we have created a table called Tbl_Sports with a provisioned throughput capacity of 10, and then we put 10 items into the table. Assuming that only a single partition is created, we are able to retrieve 10 items per second. After a point of time, we put 10 more items into the table. DynamoDB will create another partition (by hashing over the hash key), thereby satisfying the provisioned throughput capacity. There is a formula taken from the AWS site: Total provisioned throughput/partitions = throughput per partition OR No. of partitions = Total provisioned throughput/throughput per partition In order to satisfy throughput capacity, the other parameters will be automatically managed by DynamoDB. Summary In this article, we saw what the local and global secondary indexes are. We walked through projection and its usage with indexes. Resources for Article: Further resources on this subject: Comparative Study of NoSQL Products [Article] Ruby with MongoDB for Web Development [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]
Read more
  • 0
  • 0
  • 5447

article-image-article-design-patterns
Packt
21 Jul 2014
5 min read
Save for later

Design patterns

Packt
21 Jul 2014
5 min read
(For more resources related to this topic, see here.) Design patterns are ways to solve a problem and the way to get your intended result in the best possible manner. So, design patterns are not only ways to create a large and robust system, but they also provide great architectures in a friendly manner. In software engineering, a design pattern is a general repeatable and optimized solution to a commonly occurring problem within a given context in software design. It is a description or template for how to solve a problem, and the solution can be used in different instances. The following are some of the benefits of using design patterns: Maintenance Documentation Readability Ease in finding appropriate objects Ease in determining object granularity Ease in specifying object interfaces Ease in implementing even for large software projects Implements the code reusability concept If you are not familiar with design patterns, the best way to begin understanding is observing the solutions we use for commonly occurring, everyday life problems. Let's take a look at the following image: Many different types of power plugs exist in the world. So, we need a solution that is reusable, optimized, and cheaper than buying a new device for different power plug types. In simple words, we need an adapter. Have a look at the following image of an adapter: In this case, an adapter is the best solution that's reusable, optimized, and cheap. But an adapter does not provide us with a solution when our car's wheel blows out. In object-oriented languages, we the programmers use the objects to do whatever we want to have the outcome we desire. Hence, we have many types of objects, situations, and problems. That means we need more than just one approach to solving different kinds of problems. Elements of design patterns The following are the elements of design patterns: Name: This is a handle we can use to describe the problem Problem: This describes when to apply the pattern Solution: This describes the elements, relationships, responsibilities, and collaborations, in a way that we follow to solve a problem Consequences: This details the results and trade-offs of applying the pattern Classification of design patterns Design patterns are generally divided into three fundamental groups: Creational patterns Structural patterns Behavioral patterns Let's examine these in the following subsections. Creational patterns Creational patterns are a subset of design patterns in the field of software development; they serve to create objects. They decouple the design of an object from its representation. Object creation is encapsulated and outsourced (for example, in a factory) to keep the context of object creation independent from concrete implementation. This is in accordance with the rule: "Program on the interface, not the implementation." Some of the features of creational patterns are as follows: Generic instantiation: This allows objects to be created in a system without having to identify a specific class type in code (Abstract Factory and Factory pattern) Simplicity: Some of the patterns make object creation easier, so callers will not have to write large, complex code to instantiate an object (Builder (Manager) and Prototype pattern) Creation constraints: Creational patterns can put bounds on who can create objects, how they are created, and when they are created The following patterns are called creational patterns: The Abstract Factory pattern The Factory pattern The Builder (Manager) pattern The Prototype pattern The Singleton pattern Structural patterns In software engineering, design patterns structure patterns facilitate easy ways for communications between various entities. Some of the examples of structures of the samples are as follows: Composition: This composes objects into a tree structure (whole hierarchies). Composition allows customers to be uniformly treated as individual objects according to their composition. Decorator: This dynamically adds options to an object. A Decorator is a flexible alternative embodiment to extend functionality. Flies: This is a share of small objects (objects without conditions) that prevent overproduction. Adapter: This converts the interface of a class into another interface that the clients expect. Adapter lets those classes work together that would normally not be able to because of the different interfaces. Facade: This provides a unified interface meeting the various interfaces of a subsystem. Facade defines a higher-level interface to the subsystem, which is easier to use. Proxy: This implements the replacement (surrogate) of another object that controls access to the original object. Bridge: This separates an abstraction from its implementation, which can then be independently altered. Behavioral patterns Behavioral patterns are all about a class' objects' communication. Behavioral patterns are those patterns that are most specifically concerned with communication between objects. The following is a list of the behavioral patterns: Chain of Responsibility pattern Command pattern Interpreter pattern Iterator pattern Mediator pattern Memento pattern Observer pattern State pattern Strategy pattern Template pattern Visitor pattern If you want to check out the usage of some patterns in the Laravel core, have a look at the following list: The Builder (Manager) pattern: IlluminateAuthAuthManager and IlluminateSessionSessionManager The Factory pattern: IlluminateDatabaseDatabaseManager and IlluminateValidationFactory The Repository pattern: IlluminateConfigRepository and IlluminateCacheRepository The Strategy pattern: IIlluminateCacheStoreInterface and IlluminateConfigLoaderInterface The Provider pattern: IIlluminateAuthAuthServiceProvider and IlluminateHashHashServiceProvider Summary In this article, we have explained the fundamentals of design patterns. We've also introduced some design patterns that are used in the Laravel Framework. Resources for Article: Further resources on this subject: Laravel 4 - Creating a Simple CRUD Application in Hours [article] Your First Application [article] Creating and Using Composer Packages [article]
Read more
  • 0
  • 0
  • 2374
article-image-overview-architecture-and-modeling-cassandra
Packt
21 Jan 2014
5 min read
Save for later

An overview of architecture and modeling in Cassandra

Packt
21 Jan 2014
5 min read
(For more resources related to this topic, see here.) Cassandra uses a peer-to-peer architecture, unlike a master-slave architecture, which is prone to single point of failure (SPOF) problems. Cassandra is deployed on multiple machines with each machine acting as a node in a cluster. Data is autosharded, that is, automatically distributed across nodes using key-based sharding, which means that the keys are used to distribute the data across the cluster. Each key-value data element in Cassandra is replicated across the cluster on other nodes (the default replication is 3) for high availability and fault tolerance. If a node goes down, the data can be served from another node having a copy of the original data. Sharding is an old concept used for distributing data across different systems. Sharding can be horizontal or vertical. In horizontal sharding, in case of RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines. Vertical sharding is similar to columnar storage, where columns can be stored separately in different locations. Hadoop Distributed File Systems (HDFS) use data-volumes-based sharding, where a single big file is sharded and distributed across multiple machines using the block size. So, as an example, if the block size is 64 MB, a 640 MB file will be split into 10 chunks and placed in multiple machines. The same autosharding capability is used when new nodes are added to Cassandra, where the new node becomes responsible for a specific key range of data. The details of what node holds what key ranges is coordinated and shared across the cluster using the gossip protocol. So, whenever a client wants to access a specific key, each node locates the key and its associated data quickly within a few milliseconds. When the client writes data to the cluster, the data will be written to the nodes responsible for that key range. However, if the node responsible for that key range is down or not reachable, Cassandra uses a clever solution called Hinted Handoff that allows the data to be managed by another node in the cluster and to be written back on the responsible node once that node is back in the cluster. The replication of data raises the concern of data inconsistency when the replicas might have different states for the same data. Cassandra uses mechanisms such as anti-entropy and read repair for solving this problem and synchronizing data across the replicas. Anti-entropy is used at the time of compaction, where compaction is a concept borrowed from Google BigTable. Compaction in Cassandra refers to the merging of SSTable and helps in optimizing data storage and increasing read performance by reducing the number of seeks across SSTables. Another problem that compaction solves is handling deletion in Cassandra. Unlike traditional RDBMS, all deletes in Cassandra are soft deletes, which means that the records still exist in the underlying data store but are marked with a special flag so that these deleted records do not appear in query results. The records marked as deleted records are called tombstone records. Major compactions handle these soft deletes or tombstones by removing them from the SSTable in the underlying file stores. Cassandra, like Dynamo, uses a Merkle tree data structure to represent the data state at a column family level in a node. This Merkle tree representation is used during major compactions to find the difference in the data states across nodes and reconciled. The Merkle tree or Hash tree is a data structure in the form of a tree where every non-leaf node is labeled with the hash of children nodes, allowing the efficient and secure verification of the contents of the large data structure. Cassandra, like Dynamo, falls under the AP part of the CAP theorem and offers a tunable consistency level. Cassandra provides multiple consistency levels, as illustrated in the following table: Operation ZERO ANY ONE QUORUM ALL Read Not supported Not supported Reads from one node   Read from a majority of nodes with replicas Read from all the nodes with replicas Write Asynchronous write Writes on one node including hints Writes on one node with commit log and Memtable Writes on a majority of nodes with replicas Writes on all the nodes with replicas A summary of the features in Cassandra The following table summarizes the key features of Cassandra with respect to its origins in Google BigTable and Amazon Dynamo: Feature Cassandra implementation Google BigTable Amazon Dynamo Architecture Peer-to-peer architecture, ring-based deployment architecture No Yes   Data model Multidimensional map (row,column, timestamp) -> bytes Yes   No CAP theorem AP with tunable consistency No Yes   Storage architecture SSTable, Memtables Yes   No Storage layer Local filesystem storage No No Fast reads and efficient storage Bloom filters, compactions Yes   No Programming language Java No Yes   Client programming language Multiple languages supported: Java, PHP, Python, REST, C++, .NET, and so on. Not known Not known Scalability model Horizontal scalability; multiple nodes deployment than a single machine deployment Yes   Yes   Version conflicts Timestamp field (not a vector clock as usually assumed) No No Hard deletes/updates Data is always appended using the timestamp field—deletes/updates are soft appends and are cleaned asynchronously as part of major compactions Yes   No Summary Cassandra packs the best features of two technologies proven at scale—Google BigTable and Amazon Dynamo. However, today Cassandra has evolved beyond these origins with new unique and enterprise-ready features such as Cassandra Query Language (CQL), support for collection columns, lightweight transactions, and triggers. Resources for Article: Further resources on this subject: Basic Concepts and Architecture of Cassandra [Article] About Cassandra [Article] Getting Started with Apache Cassandra [Article]
Read more
  • 0
  • 0
  • 2385

article-image-organizing-backbone-applications-structure-optimize-and-deploy
Packt
21 Jan 2014
9 min read
Save for later

Organizing Backbone Applications - Structure, Optimize, and Deploy

Packt
21 Jan 2014
9 min read
(For more resources related to this topic, see here.) Creating application architecture The essential premise at the heart of Backbone has always been to try and discover the minimal set of data-structuring (Models and Collections) and user interface (Views and URLs) primitives that are useful when building web applications with JavaScript. Jeremy Ashkenas, creator of Backbone.js, Underscore.js, and CoffeeScript As Jeremy mentioned, Backbone.js has no intention, at least in the near future, to raise its bar to provide application architecture. Backbone will continue to be a lightweight tool to produce the minimal features required for web development. So, should we blame Backbone.js for not including such functionality even though there is a huge demand for this in the developer community? Certainly not! Backbone.js only yields the set of components that are necessary to create the backbone of an application and gives us complete freedom to build the app architecture in whichever way we want. If working on a significantly large JavaScript application, remember to dedicate sufficient time to planning the underlying architecture that makes the most sense. It's often more complex than you may initially imagine. Addy Osmani, author of Patterns For Large-Scale JavaScript Application Architecture So, as we start digging into more detail on creating an application architecture, we are not going to talk about trivial applications or something similar to a to-do-list app. Rather, we will investigate how to structure a medium- or large-level application. After discussions with a number of developers, we found that the main issue they face here is that there are several methodologies the online blog posts and tutorials offer to structure an application. While most of these tutorials talk about good practices, it becomes difficult to choose exactly one from them. Keeping that in mind, we will explore a number of steps that you should follow to make your app robust and maintainable in the long run. Managing a project directory This is the first step towards creating a solid app architecture. We have already discussed this in detail in the previous sections. If you are comfortable using another directory layout, go ahead with it. The directory structure will not matter much if the rest of your application is organized properly. Organizing code with AMD We will use RequireJS for our project. As discussed earlier, it comes with a bunch of facilities such as the following: Adding a lot of script tags in one HTML file and managing all of the dependencies on your own may work for a medium-level project, but will gradually fail for a large-level project. Such a project may have thousands of lines of code; managing a code base of that size requires small modules to be defined in each individual file. With RequireJS, you do not need to worry about how many files you have—you just know that if the standard is followed properly, it is bound to work. The global namespace is never touched and you can freely give the best names to something that matches with it the most. Debugging the RequireJS modules is a lot easier than other approaches because you know what the dependencies and path to each of them are in every module definition. You can use r.js, an optimization tool for RequireJS that minifies all the JavaScript and CSS files, to create the production-ready build. Setting up an application For a Backbone app, there must be a centralized object that will hold together all the components of the application. In a simple application, most people generally just make the main router work as the central object. But that will surely not work for a large application and you need an Application object that should work as the parent component. This object should have a method (mostly init()) that will work as the entry point to your application and initialize the main router along with the Backbone history. In addition, either your Application class should extend Backbone.Events or it should include a property that points to an instance of the Backbone.Events class. The benefit of doing this is that the app or Backbone.Events instance can act as a central event aggregator, and you can trigger application-level events on it. A very basic Application class will look like the following code snippet: // File: application.js define([ 'underscore', 'backbone', 'router' ], function (_, Backbone, Router) { // the event aggregator var PubSub = _.extend({}, Backbone.Events); var Application = function () { // Do useful stuff here } _.extend(Application.prototype, { pubsub: new PubSub(), init: function () { Backbone.history.start(); } }); return Application; }); Application is a simple class with an init() method and a PubSub instance. The init() method acts as the starting point of the application and PubSub works as the application-level event manager. You can add more functionality to the Application class, such as starting and stopping modules and adding a region manager for view layout management. It is advisable to keep this class as short as you can. Using the module pattern We often see that intermediate-level developers find it a bit confusing to initially use a module-based architecture. It can be a little difficult for them to make the transition from a simple MVC architecture to a modular MVC architecture. While the points we are discussing in this article are valid for both these architectures, we should always prefer to use a modular concept in nontrivial applications for better maintainability and organization. In the directory structure section, we saw how the module consists of a main.js file, its views, models, and collections all together. The main.js file will define the module and have different methods to manage the other components of that module. It works as the starting point of the module. A simple main.js file will look like the following code: // File: main.js define([ 'app/modules/user/views/userlist', 'app/modules/user/views/userdetails' ], function (UserList, UserDetails) { var myVar; return { initialize: function () { this.showUserList(); }, showUsersList: function () { var userList = new UserList(); userList.show(); }, showUserDetails: function (userModel) { var userDetails = new UserDetails({ model: userModel }); userDetails.show(); } }; }); As you can see, the responsibility of this file is to initiate the module and manage the components of that module. We have to make sure that it handles only parent-level tasks; it shouldn't contain a method that one of its views should ideally have. The concept is not very complex, but you need to set it up properly in order to use it for a large application. You can even go for an existing app and module setup and integrate it with your Backbone app. For instance, Marionette provides an application infrastructure for Backbone apps. You can use its inbuilt Application and Module classes to structure your application. It also provides a general-purpose Controller class—something that doesn't come with the Backbone library but can be used as a mediator to provide generic methods and work as a common medium among the modules. You can also use AuraJS (https://github.com/aurajs/aura), a framework-agonistic event-driven architecture developed by Addy Osmani (http://addyosmani.com) and many others; it works quite well with Backbone.js. A thorough discussion on AuraJS is beyond the scope of this book, but you can grab a lot of useful information about it from its documentation and examples (https://github.com/aurajs/todomvc). It is an excellent boilerplate tool that gives your app a kick-start and we highly recommend it, especially if you are not using the Marionette application infrastructure. The following are a few benefits of using AuraJS ; they may help you choose this framework for your application: AuraJS is framework-agnostic. Though it works great with Backbone.js, you can use it for your JavaScript module architecture even if you aren't using Backbone.js. It utilizes the module pattern, application-level and module-level communication using the facade (sandbox) and mediator patterns. It abstracts away the utility libraries that you use (such as templating and DOM manipulation) so you can swap alternatives anytime you want. Managing objects and module communication One of the most important ways to keep the application code maintainable is to reduce the tight coupling between modules and objects. If you are following the module pattern, you should never let one module communicate with another directly. Loose coupling adds a level of restriction in your code, and a change in one module will never enforce a change in the rest of the application. Moreover, it lets you re-use the same modules elsewhere. But how can we communicate if there is no direct relationship? The two important patterns we use in this case are the observer and mediator patterns. Using the observer/PubSub pattern The PubSub pattern is nothing but the event dispatcher. It works as a messaging channel between the object (publisher) that fires the event and another object (subscriber) that receives the notification. We mentioned earlier that we can have an application-level event aggregator as a property of the Application object. This event aggregator can work as the common channel via which the other modules can communicate, and that too without interacting directly. Even at the module-level, you may need a common event dispatcher only for that module; the views, models, and collections of that module can use it to communicate with each other. However, publishing too many events via a dispatcher sometimes makes it difficult to manage them and you must be careful enough to understand which events you should publish via a generic dispatcher and which ones you should fire on a certain component only. Anyhow, this pattern is one of the best tools to design a decoupled system, and you should always have one ready for use in your module-based application. Summary This article dealt with one of the most important topics of Backbone.js-based application development. At the framework level, learning Backbone is quite easy and developers get a complete grasp over it in a very short period of time. Resources for Article: Further resources on this subject: Building an app using Backbone.js [article] Testing Backbone.js Application [article] Understanding Backbone [article]
Read more
  • 0
  • 0
  • 2250