Databases | 12 articles | Tech News, Tutorials & Expert Insights

article-image-postgresql-committer-stephen-frost-shares-vision-postgresql-12-beyond

04 Dec 2019

8 min read

PostgreSQL committer Stephen Frost shares his vision for PostgreSQL version 12 and beyond

04 Dec 2019

PostgreSQL version 12 was released in October this year and has earned a strong reputation for being reliable, feature robust, and performant. During the PostgreSQL Conference for the Asia Pacific, PostgreSQL major contributor and committer, Stephen Frost talked about a number of new features available in PostgreSQL 12 including Pluggable Storage, Partitioning and performance improvements as well as SQL features. In this post we have tried to cover a short synopsis of his Keynote. The full talk is available on YouTube. Want to learn how to build your own PostgreSQL applications? PostgreSQL 12 has an array of interesting features such as advanced indexing, high availability, database configuration, database monitoring, to efficiently manage and maintain your database. If you are a database developer and want to leverage PostgreSQL 12, we recommend you to go through our latest book Mastering PostgreSQL 12 - Third Edition written by Hans-Jürgen Schönig. This book examines in detail the newly released features in PostgreSQL 12 to help you build efficient and fault-tolerant PostgreSQL applications. [dropcap]S[/dropcap]tephen Frost is the CTO of Crunchy Data. As a PostgreSQL major contributor he has implemented ‘roles support’ in version 8.1 to replace the existing user/group system, SQL column-level privileges in version 8.4, and Row Level Security in PostgreSQL 9.5. He has also spoken at numerous conferences, including pgConf.EU, pgConf.US, PostgresOpen, SCALE and others. Stephen Frost on PostgreSQL 12 features Pluggable Storage This release introduces the pluggable table storage interface, which allows developers to create their own methods for storing data. Postgres before version 12 had one storage engine - one primary heap. All indexes were secondary indexes which means that they referred directly into pointers on disk. Also, this heap structure was row-based. Every row has a big header associated with the row which may cause issues when storing very narrow tables (which have two or fewer columns included in it.) PostgreSQL 12 now has the ability to have multiple different storage formats underneath - the pluggable storage. This new feature is going to be the basis for columnar storage coming up probably in v13 or v14. It's also going to be the basis for Z heap - which is an alternative heap that is going to allow in-place updates and uses an undo log instead of using the redo log that PostgreSQL has. Version 12 is building on the infrastructure for pluggable storage, and the team does not have anything user-facing yet. It's not going to be until v13 and later that they will actually have new storage mechanisms that are built on top of the pluggable storage feature. Partitioning improvements Postgres is adding in a whole bunch of new features to make working with partitions in Postgres easier to deal with. Postgres 12 has major improvements in declarative partitioning capability which makes working with partitions more effective and easier to deal with. Partition selection has dramatically improved especially when selecting from a few partitions out of a large set. Postgres 12 has the ability to ATTACH/DETATCH CONCURRENTLY. This is the ability to, on the fly, attach a partition and detach a partition without having to take very heavy locks. You can add new partitions to your partitioning scheme without having to take any outage or downtime or have any impact on the ongoing operations of the system. This release also increases the number of partitions. The initial declarative partitioning patch made planning slow when you got over a few hundred partitions. This is now fixed with a much faster planning methodology for dealing with partitions. Postgres 12 allows Multi-INSERT during COPY statements into partitioned tables. COPY is a way in which you can bulk load data into Postgres. This feature makes it much faster to COPY into partitioned tables. There is also a new function pg_partition_tree to display partition info. Performance improvements and SQL features Parallel Query with SERIALIZABLE Parallel Query has been in Postgres since version 9.6 but it did not work with the serializable isolation level. Serializable is actually the highest level of isolation that you can have inside of Postgres. With Postgres 12, you have the ability to have a parallel query with serializable and have that highest level of isolation. This increases the number of places that parallel query can be used. This also allows application authors to worry less about concurrency because serializable in Postgres provides true serializability that exists in very few databases. Faster float handling Postgres 12 has a new library for converting floating point into text. This provides a significant speedup for many workloads where you're doing text-based transfer of data. Although it may result in slightly different (possibly more correct) output. Partial de-TOAST/decompress a value Historically to access any compressed TOAST value, you had to decompress the whole thing into memory. This was not very ideal in situations where you wanted access to, only the front of it. Partial de-TOAST allows decompressing of a section of the TOAST value. This also gives a great improvement in performance for cases like: PostGIS geometry/geography- data at the front can be used for filtering Pulling just the start of a text string COPY FROM with WHERE Postgres 12 now has a WHERE clause supported by a COPY FROM statement. This allows you to filter data/records while importing. Earlier it was done using the file_fdw, but it was tedious as it required creating a foreign table. Create or Replace Aggregate This features allows an aggregate to either be created, if it does not exist, or replaced it it does. It makes extension upgrade scripts much simpler. This feature was requested specifically by the Postgres community. Inline Common Table Expressions Not having Inline CTEs was seen as an optimization barrier. From version 12, Postgres, by default, inlines the CTEs if it can. It also supports the old behavior so in the event that you actually want CTE to be an optimization barrier, you can still do that. You just have to specify WITH MATERIALIZED when you go to write your CTE. SQL/JSON improvements There is also progress made towards supporting the SQL/JSON standard Added a number of json_path functions json_path_exists json_path_match json_path_query json_path_query_array json_path_query_first Added new operators for working with json: jsonb @? Jsonpath - wrapper for jsonpath_exists jsonb @? Jsonpath - wrapper for jsonpath_predicate Index support should also be added for these operators soon Recovery.conf moved into postgresql.conf Recovery.conf is no longer available in Postgresql 12 and all options are moved to postgresql.conf. This allows changing recovery parameters via ALTER SYSTEM. This feature increases flexibility meaning that it allows changing primary via ALTER SYSTEM/reload. However, this is a disruptive change. Every high-level environment will change but it reduces the fragility of high-level solutions moving forward. A new pg_promote function is added to allow promoting a replica from SQL. Control SSL protocol With Postgres 12, you can now control SSL protocols. Older SSL protocols are required to be disabled for security reasons. They were enforced previously with FIPS mode. They are now addressed in CIS benchmark/ STIG updates. Covering GIST indexes GIST indexes can now also use INCLUDE. These are useful for adding columns to allow index-only queries. It also allows including columns that are not part of the search key. Add CSV output mode to psql Previously you could get CSV but you had to do that by taking your query inside a copy statement. Now you can use the new pset format for CSV output from psql. It returns data in each row in CSV format instead of tabular format. Add option to sample queries This is a new log_statement_sample_rate parameter which allows you to set log_min duration to be very very low or zero. Logging all statements is very expensive as it slows down the whole system and you end up having a backlog of processes trying to write into the logging system. The new log_statement_sample_rate parameter includes only a sample of those queries in the output rather than logging every query. The log_min_duration_statement excludes very fast queries. It helps with analysis in environments with lots of fast queries. New HBA option called clientcert=verify-full This new HBA option allows you to do a two-factor authentication where one of the factors is a certificate and the other one might be a password or something else (PAM, LDAP, etc). It gives you the ability to say that every user has to have a client-side certificate and that the client-side certificate must be validated by the server on connection and have to provide a password. It works with non-cert authentication methods and requires client-side certificates to be used. In his talk, Stephen also answered commonly asked questions about Postgres, watch the full video to know more. You can read about other performance and maintenance enhancements in PostgreSQL 12 on the official blog. To learn advanced PostgreSQL 12 concepts with real-world examples and sample datasets, go through the book Mastering PostgreSQL 12 - Third Edition by Hans-Jürgen Schönig. Introducing PostgREST, a REST API for any PostgreSQL database written in Haskell Percona announces Percona Distribution for PostgreSQL to support open source databases Wasmer’s first Postgres extension to run WebAssembly is here!

0
0
4269

article-image-what-does-a-data-science-team-look-like

Fatema Patrawala

21 Nov 2019

11 min read

What does a data science team look like?

Fatema Patrawala

21 Nov 2019

11 min read

Until a couple of years ago, people barely knew the term 'data science' which has now evolved into an extremely popular career field. The Harvard Business Review dubbed data scientist within the data science team as the sexiest job of the 21st century and expert professionals jumped on the data is the new oil bandwagon. As per the Figure Eight Report 2018, which takes the pulse of the data science community in the US, a lot has changed rapidly in the data science field over the years. For the 2018 report, they surveyed approximately 240 data scientists and found out that machine learning projects have multiplied and more and more data is required to power them. Data science and machine learning jobs are LinkedIn's fastest growing jobs. And the internet is creating 2.5 quintillion bytes of data to process and analyze each day. With all these changes, it is evident for data science teams to evolve and change among various organizations. The data science team is responsible for delivering complex projects where system analysis, software engineering, data engineering, and data science is used to deliver the final solution. To achieve all of this, the team does not only have a data scientist or a data analyst but also includes other roles like business analyst, data engineer or architect, and chief data officer. In this post, we will differentiate and discuss various job roles within a data science team, skill sets required and the compensation benefit for each one of them. For an in-depth understanding of data science teams, read the book, Managing Data Science by Kirill Dubovikov, which has interesting case studies on building successful data science teams. He also explores how the team can efficiently manage data science projects through the use of DevOps and ModelOps. Now let's get into understanding individual data science roles and functions, but before that we take a look at the structure of the team.There are three basic team structures to match different stages of AI/ML adoption: IT centric team structure At times for companies hiring a data science team is not an option, and they have to leverage in-house talent. During such situations, they take advantage of the fully functional in-house IT department. The IT team manages functions like data preparation, training models, creating user interfaces, and model deployment within the corporate IT infrastructure. This approach is fairly limited, but it is made practical by MLaaS solutions. Environments like Microsoft Azure or Amazon Web Services (AWS) are equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy. Microsoft Azure, for instance, supports its users with detailed documentation for a low entry threshold. The documentation helps in fast training and early deployment of models even without an expert data scientists on board. Integrated team structure Within the integrated structure, companies have a data science team which focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure for model deployment. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations. Unlike the IT centric approach, the integrated method requires having an experienced data scientist within the team. This approach ensures better operational flexibility in terms of available techniques. Additionally, the team leverages deeper understanding of machine learning tools and libraries – like TensorFlow or Theano which are specifically for researchers and data science experts. Specialized data science team Companies can also have an independent data science department to build an all-encompassing machine learning applications and frameworks. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are handled by a dedicated data science team. It doesn't necessarily mean that all team members should have a data science background, but they should have technology background with certain service management skills. A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML backed services. Today's most successful Silicon Valley tech operates with specialized data science teams. Additionally they are custom-built and wired for specific tasks to achieve different business goals. For example, the team structure at Airbnb is one of the most interesting use cases. Martin Daniel, a data scientist at Airbnb in this talk explains how the team emphasizes on having an experimentation-centric culture and apply machine learning rigorously to address unique product challenges. Job roles and responsibilities within data science team As discussed earlier, there are many roles within a data science team. As per Michael Hochster, Director of Data Science at Stitch Fix, there are two types of data scientists: Type A and Type B. Type A stands for analysis. Individuals involved in Type A are statisticians that make sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc. Type B stands for building. These individuals use data in production. They're good software engineers with strong programming knowledge and statistics background. They build recommendation systems, personalization use cases, etc. Though it is rare that one expert will fit into a single category. But understanding these data science functions can help make sense of the roles described further. Chief data officer/Chief analytics officer The chief data officer (CDO) role has been taking organizations by storm. A recent NewVantage Partners' Big Data Executive Survey 2018 found that 62.5% of Fortune 1000 business and technology decision-makers said their organization appointed a chief data officer. The role of chief data officer involves overseeing a range of data-related functions that may include data management, ensuring data quality and creating data strategy. He or she may also be responsible for data analytics and business intelligence, the process of drawing valuable insights from data. Even though chief data officer and chief analytics officer (CAO) are two distinct roles, it is often handled by the same person. Expert professionals and leaders in analytics also own the data strategy and how a company should treat its data. It does make sense as analytics provide insights and value to the data. Hence, with a CDO+CAO combination companies can take advantage of a good data strategy and proper data management without losing on quality. According to compensation analysis from PayScale, the median chief data officer salary is $177,405 per year, including bonuses and profit share, ranging from $118,427 to $313,791 annually. Skill sets required: Data science and analytics, programming skills, domain expertise, leadership and visionary abilities are required. Data analyst The data analyst role implies proper data collection and interpretation activities. The person in this job role will ensure that collected data is relevant and exhaustive while also interpreting the results of the data analysis. Some companies also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics. As per Indeed, the average salary for a data analyst is $68,195 per year in the United States. Skill sets required: Programming languages like R, Python, JavaScript, C/C++, SQL. With this critical thinking, data visualization and presentation skills will be good to have. Data scientist Data scientists are data experts who have the technical skills to solve complex problems and the curiosity to explore what problems are needed to be solved. A data scientist is an individual who develops machine learning models to make predictions and is well versed in algorithm development and computer science. This person will also know the complete lifecycle of the model development. A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets. According to Glassdoor, the current U.S. average salary for a data scientist is $118,709. Skills set required: A data scientist will require knowledge of big data platforms and tools like Seahorse powered by Apache Spark, JupyterLab, TensorFlow and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R. They should also have cloud computing capabilities and knowledge of various cloud platforms like AWS, Microsoft Azure etc.You can also read this post on how to ace a data science interview to know more. Machine learning engineer At times a data scientist is confused with machine learning engineers, but a machine learning engineer is a distinct role that involves different responsibilities. A machine learning engineer is someone who is responsible for combining software engineering and machine modeling skills. This person determines which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is the ML engineer's job. The average machine learning engineer's salary is $146,085 in the US, and is ranked No.1 on the Indeed's Best Jobs in 2019 list. Skill sets required: Machine learning engineers will be required to have expertise in computer science and programming languages like R, Python, Scala, Java etc. They would also be required to have probability techniques, data modelling and evaluation techniques. Data architects and data engineers The data architects and data engineers work in tandem to conceptualize, visualize, and build an enterprise data management framework. The data architect visualizes the complete framework to create a blueprint, which the data engineer can use to build a digital framework. The data engineering role has recently evolved from the traditional software-engineering field. Recent enterprise data management experiments indicate that the data-focused software engineers are needed to work along with the data architects to build a strong data architecture. Average salary for a data architect in the US ranges from $1,22,000 to $1,29, 000 annually as per a recent LinkedIn survey. Skill sets required: A data architect or an engineer should have a keen interest and experience in programming languages frameworks like HTML5, RESTful services, Spark, Python, Hive, Kafka, and CSS etc. They should have the required knowledge and experience to handle database technologies such as PostgreSQL, MapReduce and MongoDB and visualization platforms such as; Tableau, Spotfire etc. Business analyst A business analyst (BA) basically handles Chief analytics officer's role but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst can bridge the gap. They are responsible for using data analytics to assess processes, determine requirements and deliver data-driven recommendations and reports to executives and stakeholders. BAs engage with business leaders and users to understand how data-driven changes will be implemented to processes, products, services, software and hardware. They further articulate these ideas and balance them against technologically feasible and financially reasonable. The average salary for a business analyst is $75,078 per year in the United States, as per Indeed. Skill sets required: Excellent domain and industry expertise will be required. With this good communication as well as data visualization skills and knowledge of business intelligence tools will be good to have. Data visualization engineer This specific role is not present in each of the data science teams as some of the responsibilities are realized by either a data analyst or a data architect. Hence, this role is only necessary for a specialized data science model. The role of a data visualization engineer involves having a solid understanding of UI development to create custom data visualization elements for your stakeholders. Regardless of the technology, successful data visualization engineers have to understand principles of design, both graphical and more generally user-centered design. As per Payscale, the average salary for a data visualization engineer is $98,264. Skill sets required: A data visualization engineer need to have rigorous knowledge of data visualization methods and be able to produce various charts and graphs to represent data. Additionally they must understand the fundamentals of design principles and visual display of information. To sum it up, a data science team has evolved to create a number of job roles and opportunities, but companies still face challenges in building up the team from scratch and find it hard to figure where to start from. If you are facing a similar dilemma, check out this book, Managing Data Science, written by Kirill Dubovikov. It covers concepts and methodologies to manage and deliver top-notch data science solutions, while also providing guidance on hiring, growing and sustaining a successful data science team. How to learn data science: from data mining to machine learning How to ace a data science interview Data science vs. machine learning: understanding the difference and what it means today 30 common data science terms explained 9 Data Science Myths Debunked

0
0
10157

article-image-building-a-scalable-postgresql-solution

Natasha Mathur

14 Apr 2019

12 min read

Building a scalable PostgreSQL solution

Natasha Mathur

14 Apr 2019

12 min read

The term Scalability means the ability of a software system to grow as the business using it grows. PostgreSQL provides some features that help you to build a scalable solution but, strictly speaking, PostgreSQL itself is not scalable. It can effectively utilize the following resources of a single machine: It uses multiple CPU cores to execute a single query faster with the parallel query feature When configured properly, it can use all available memory for caching The size of the database is not limited; PostgreSQL can utilize multiple hard disks when multiple tablespaces are created; with partitioning, the hard disks could be accessed simultaneously, which makes data processing faster However, when it comes to spreading a database solution to multiple machines, it can be quite problematic because a standard PostgreSQL server can only run on a single machine. In this article, we will look at different scaling scenarios and their implementation in PostgreSQL. The requirement for a system to be scalable means that a system that supports a business now, should also be able to support the same business with the same quality of service as it grows. This article is an excerpt taken from the book 'Learning PostgreSQL 11 - Third Edition' written by Andrey Volkov and Salahadin Juba. The book explores the concepts of relational databases and their core principles. You’ll get to grips with using data warehousing in analytical solutions and reports and scaling the database for high availability and performance. Let's say a database can store 1 GB of data and effectively process 100 queries per second. What if with the development of the business, the amount of data being processed grows 100 times? Will it be able to support 10,000 queries per second and process 100 GB of data? Maybe not now, and not in the same installation. However, a scalable solution should be ready to be expanded to be able to handle the load as soon as it is needed. In scenarios where it is required to achieve better performance, it is quite common to set up more servers that would handle additional load and copy the same data to them from a master server. In scenarios where high availability is required, this is also a typical solution to continuously copy the data to a standby server so that it could take over in case the master server crashes. Scalable PostgreSQL solution Replication can be used in many scaling scenarios. Its primary purpose is to create and maintain a backup database in case of system failure. This is especially true for physical replication. However, replication can also be used to improve the performance of a solution based on PostgreSQL. Sometimes, third-party tools can be used to implement complex scaling scenarios. Scaling for heavy querying Imagine there's a system that's supposed to handle a lot of read requests. For example, there could be an application that implements an HTTP API endpoint that supports the auto-completion functionality on a website. Each time a user enters a character in a web form, the system searches in the database for objects whose name starts with the string the user has entered. The number of queries can be very big because of the large number of users, and also because several requests are processed for every user session. To handle large numbers of requests, the database should be able to utilize multiple CPU cores. In case the number of simultaneous requests is really large, the number of cores required to process them can be greater than a single machine could have. The same applies to a system that is supposed to handle multiple heavy queries at the same time. You don't need a lot of queries, but when the queries themselves are big, using as many CPUs as possible would offer a performance benefit—especially when parallel query execution is used. In such scenarios, where one database cannot handle the load, it's possible to set up multiple databases, set up replication from one master database to all of them, making each them work as a hot standby, and then let the application query different databases for different requests. The application itself can be smart and query a different database each time, but that would require a special implementation of the data-access component of the application, which could look as follows: Another option is to use a tool called Pgpool-II, which can work as a load-balancer in front of several PostgreSQL databases. The tool exposes a SQL interface, and applications can connect there as if it were a real PostgreSQL server. Then Pgpool-II will redirect the queries to the databases that are executing the fewest queries at that moment; in other words, it will perform load-balancing: Yet another option is to scale the application together with the databases so that one instance of the application will connect to one instance of the database. In that case, the users of the application should connect to one of the many instances. This can be achieved with HTTP load-balancing: Data sharding When the problem is not the number of concurrent queries, but the size of the database and the speed of a single query, a different approach can be implemented. The data can be separated into several servers, which will be queried in parallel, and then the result of the queries will be consolidated outside of those databases. This is called data sharding. PostgreSQL provides a way to implement sharding based on table partitioning, where partitions are located on different servers and another one, the master server, uses them as foreign tables. When performing a query on a parent table defined on the master server, depending on the WHERE clause and the definitions of the partitions, PostgreSQL can recognize which partitions contain the data that is requested and would query only these partitions. Depending on the query, sometimes joins, grouping and aggregation could be performed on the remote servers. PostgreSQL can query different partitions in parallel, which will effectively utilize the resources of several machines. Having all this, it's possible to build a solution when applications would connect to a single database that would physically execute their queries on different database servers depending on the data that is being queried. It's also possible to build sharding algorithms into the applications that use PostgreSQL. In short, applications would be expected to know what data is located in which database, write it only there, and read it only from there. This would add a lot of complexity to the applications. Another option is to use one of the PostgreSQL-based sharding solutions available on the market or open source solutions. They have their own pros and cons, but the common problem is that they are based on previous releases of PostgreSQL and don't use the most recent features (sometimes providing their own features instead). One of the most popular sharding solutions is Postgres-XL, which implements a shared-nothing architecture using multiple servers running PostgreSQL. The system has several components: Multiple data nodes: Store the data A single global transaction monitor (GTM): Manages the cluster, provides global transaction consistency Multiple coordinator nodes: Supports user connections, builds query-execution plans, and interacts with the GTM and the data nodes Postgres-XL implements the same API as PostgreSQL, therefore the applications don't need to treat the server in any special way. It is ACID-compliant, meaning it supports transactions and integrity constraints. The COPY command is also supported. The main benefits of using Postgres-XL are as follows: It can scale to support more reading operations by adding more data nodes It can scale for to support more writing operations by adding more coordinator nodes The current release of Postgres-XL (at the time of writing) is based on PostgreSQL 10, which is relatively new The main downside of Postgres-XL is that it does not provide any high-availability features out of the box. When more servers are added to a cluster, the probability of the failure of any of them increases. That's why you should take care with backups or implement replication of the data nodes themselves. Postgres-XL is open source, but commercial support is available. Another solution worth mentioning is Greenplum. It's positioned as an implementation of a massive parallel-processing database, specifically designed for data warehouses. It has the following components: Master node: Manages user connections, builds query execution plans, manages transactions Data nodes: Store the data and perform queries Greenplum also implements the PostgreSQL API, and applications can connect to a Greenplum database without any changes. It supports transactions, but support for integrity constraints is limited. The COPY command is supported. The main benefits of Greenplum are as follows: It can scale to support more reading operations by adding more data nodes. It supports column-oriented table organization, which can be useful for data-warehousing solutions. Data compression is supported. High-availability features are supported out of the box. It's possible (and recommended) to add a secondary master that would take over in case a primary master crashes. It's also possible to add mirrors to the data nodes to prevent data loss. The drawbacks are as follows: It doesn't scale to support more writing operations. Everything goes through the single master node and adding more data nodes does not make writing faster. However, it's possible to import data from files directly on the data nodes. It uses PostgreSQL 8.4 in its core. Greenplum has a lot of improvements and new features added to the base PostgreSQL code, but it's still based on a very old release; however, the system is being actively developed. Greenplum doesn't support foreign keys, and support for unique constraints is limited. There are commercial and open source editions of Greenplum. Scaling for many numbers of connections Yet another use case related to scalability is when the number of database connections is great. However, when a single database is used in an environment with a lot of microservices and each has its own connection pool, even if they don't perform too many queries, it's possible that hundreds or even thousands of connections are opened in the database. Each connection consumes server resources and just the requirement to handle a great number of connections can already be a problem, without even performing any queries. If applications don't use connection pooling and open connections only when they need to query the database and close them afterwards, another problem could occur. Establishing a database connection takes time—not too much, but when the number of operations is great, the total overhead will be significant. There is a tool, named PgBouncer, that implements a connection-pool functionality. It can accept connections from many applications as if it were a PostgreSQL server and then open a limited number of connections towards the database. It would reuse the same database connections for multiple applications' connections. The process of establishing a connection from an application to PgBouncer is much faster than connecting to a real database because PgBouncer doesn't need to initialize a database backend process for the session. PgBouncer can create multiple connection pools that work in one of the three modes: Session mode: A connection to a PostgreSQL server is used for the lifetime of a client connection to PgBouncer. Such a setup could be used to speed up the connection process on the application side. This is the default mode. Transaction mode: A connection to PostgreSQL is used for a single transaction that a client performs. That could be used to reduce the number of connections at the PostgreSQL side when only a few translations are performed simultaneously. Statement mode: A database connection is used for a single statement. Then it is returned to the pool and a different connection is used for the next statement. This mode is similar to the transaction mode, though more aggressive. Note that multi-statement transactions are not possible when statement mode is used. Different pools can be set up to work in different modes. It's possible to let PgBouncer connect to multiple PostgreSQL servers, thus working as a reverse proxy. The way PgBouncer could be used is represented in the following diagram: PgBouncer establishes several connections to the database. When an application connects to PgBouncer and starts a transaction, PgBouncer assigns an existing database connection to that application, forwards all SQL commands to the database, and delivers the results back. When the transaction is finished, PgBouncer will dissociate the connections, but not close them. If another application starts a transaction, the same database connection could be used. Such a setup requires configuring PgBouncer to work in transaction mode. PostgreSQL provides several ways to implement replication that would maintain a copy of the data from a database on another server or servers. This can be used as a backup or a standby solution that would take over in case the main server crashes. Replication can also be used to improve the performance of a software system by making it possible to distribute the load on several database servers. In this article, we discussed the problem of building scalable solutions based on PostgreSQL utilizing the resources of several servers. We looked at scaling for querying, data sharding, as well as scaling for many numbers of connections. If you enjoyed reading this article and want to explore other topics, be sure to check out the book 'Learning PostgreSQL 11 - Third Edition'. Handling backup and recovery in PostgreSQL 10 [Tutorial] Understanding SQL Server recovery models to effectively backup and restore your database Saving backups on cloud services with ElasticSearch plugins

0
0
16935

article-image-neo4j-most-popular-graph-database

Amey Varangaonkar

02 Aug 2018

7 min read

Why Neo4j is the most popular graph database

Amey Varangaonkar

02 Aug 2018

7 min read

Neo4j is an open source, distributed data store used to model graph problems. It departs from the traditional nomenclature of database technologies, in which entities are stored in schema-less, entity-like structures called nodes, which are connected to other nodes via relationships or edges. In this article, we are going to discuss the different features and use-cases of Neo4j. This article is an excerpt taken from the book 'Seven NoSQL Databases in a Week' written by Aaron Ploetz et al. Neo4j's best features Aside from its support of the property graph model, Neo4j has several other features that make it a desirable data store. Here, we will examine some of those features and discuss how they can be utilized in a successful Neo4j cluster. Clustering Enterprise Neo4j offers horizontal scaling through two types of clustering. The first is the typical high-availability clustering, in which several slave servers process data overseen by an elected master. In the event that one of the instances should fail, a new master is chosen. The second type of clustering is known as causal clustering. This option provides additional features, such as disposable read replicas and built-in load balancing, that help abstract the distributed nature of the clustered database from the developer. It also supports causal consistency, which aims to support Atomicity Consistency Isolation and Durability (ACID) compliant consistency in use cases where eventual consistency becomes problematic. Essentially, causal consistency is delivered with a distributed transaction algorithm that ensures that a user will be able to immediately read their own write, regardless of which instance handles the request. Neo4j Browser Neo4j ships with Neo4j Browser, a web-based application that can be used for database management, operations, and the execution of Cypher queries. In addition to, monitoring the instance on which it runs, Neo4j Browser also comes with a few built-in learning tools designed to help new users acclimate themselves to Neo4j and graph databases. Neo4j Browser is a huge step up from the command-line tools that dominate the NoSQL landscape. Cache sharding In most clustered Neo4j configurations, a single instance contains a complete copy of the data. At the moment, true sharding is not available, but Neo4j does have a feature known as cache sharding. This feature involves directing queries to instances that only have certain parts of the cache preloaded, so that read requests for extremely large data sets can be adequately served. Help for beginners One of the things that Neo4j does better than most NoSQL data stores is the amount of documentation and tutorials that it has made available for new users. The Neo4j website provides a few links to get started with in-person or online training, as well as meetups and conferences to become acclimated to the community. The Neo4j documentation is very well-done and kept up to date, complete with well-written manuals on development, operations, and data modeling. The blogs and videos by the Neo4j, Inc. engineers are also quite helpful in getting beginners started on the right path. Additionally, when first connecting to your instance/cluster with Neo4j Browser, the first thing that is shown is a list of links directed at beginners. These links direct the user to information about the Neo4j product, graph modeling and use cases, and interactive examples. In fact, executing the play movies command brings up a tutorial that loads a database of movies. This database consists of various nodes and edges that are designed to illustrate the relationships between actors and their roles in various films. Neo4j's versatility demonstrated in its wide use cases Because of Neo4j's focus on node/edge traversal, it is a good fit for use cases requiring analysis and examination of relationships. The property graph model helps to define those relationships in meaningful ways, enabling the user to make informed decisions. Bearing that in mind, there are several use cases for Neo4j (and other graph databases) that seem to fit naturally. Social networks Social networks seem to be a natural fit for graph databases. Individuals have friends, attend events, check in to geographical locations, create posts, and send messages. All of these different aspects can be tracked and managed with a graph database such as Neo4j. Who can see a certain person's posts? Friends? Friends of friends? Who will be attending a certain event? How is a person connected to others attending the same event? In small numbers, these problems could be solved with a number of data stores. But what about an event with several thousand people attending, where each person has a network of 500 friends? Neo4j can help to solve a multitude of problems in this domain, and appropriately scale to meet increasing levels of operational complexity. Matchmaking Like social networks, Neo4j is also a good fit for solving problems presented by matchmaking or dating sites. In this way, a person's interests, goals, and other properties can be traversed and matched to profiles that share certain levels of equality. Additionally, the underlying model can also be applied to prevent certain matches or block specific contacts, which can be useful for this type of application. Network management Working with an enterprise-grade network can be quite complicated. Devices are typically broken up into different domains, sometimes have physical and logical layers, and tend to share a delicate relationship of dependencies with each other. In addition, networks might be very dynamic because of hardware failure/replacement, organization, and personnel changes. The property graph model can be applied to adequately work with the complexity of such networks. In a use case study with Enterprise Management Associates (EMA), this type of problem was reported as an excellent format for capturing and modeling the inter dependencies that can help to diagnose failures. For instance, if a particular device needs to be shut down for maintenance, you would need to be aware of other devices and domains that are dependent on it, in a multitude of directions. Neo4j allows you to capture that easily and naturally without having to define a whole mess of linear relationships between each device. The path of relationships can then be easily traversed at query time to provide the necessary results. Analytics Many scalable data store technologies are not particularly suitable for business analysis or online analytical processing (OLAP) uses. When working with large amounts of data, coalescing desired data can be tricky with relational database management systems (RDBMS). Some enterprises will even duplicate their RDBMS into a separate system for OLAP so as not to interfere with their online transaction processing (OLTP) workloads. Neo4j can scale to present meaningful data about relationships between different enterprise-marketing entities, which is crucial for businesses. Recommendation engines Many brick-and-mortar and online retailers collect data about their customers' shopping habits. However, many of them fail to properly utilize this data to their advantage. Graph databases, such as Neo4j, can help assemble the bigger picture of customer habits for searching and purchasing, and even take trends in geographic areas into consideration. For example, purchasing data may contain patterns indicating that certain customers tend to buy certain beverages on Friday evenings. Based on the relationships of other customers to products in that area, the engine could also suggest things such as cups, mugs, or glassware. Is the customer also a male in his thirties from a sports-obsessed area? Perhaps suggesting a mug supporting the local football team may spark an additional sale. An engine backed by Neo4j may be able to help a retailer uncover these small troves of insight. To summarize, we saw Neo4j is widely used across all enterprises and businesses, primarily due to its speed, efficiency and accuracy. Check out the book Seven NoSQL Databases in a Week to learn more about Neo4j and the other popularly used NoSQL databases such as Redis, HBase, MongoDB, and more. Read more Top 5 programming languages for crunching Big Data effectively Top 5 NoSQL Databases Is Apache Spark today’s Hadoop?

0
0
9115

article-image-polyglot-persistence-what-is-it-and-why-does-it-matter

Richard Gall

21 Jul 2018

3 min read

Polyglot persistence: what is it and why does it matter?

Richard Gall

21 Jul 2018

3 min read

Polyglot persistence is a way of storing data. It's an approach that acknowledges that often there is no one size fits all solution to data storage. From the types of data you're trying to store to your application architecture, polyglot persistence is a hybrid solution to data management. Think of polyglot programming. If polyglot programming is about using a variety of languages according to the context in which your working, polyglot persistence is applying that principle to database architecture. For example, storing transactional data in Hadoop files is possible, but makes little sense. On the other hand, processing petabytes of Internet logs using a Relational Database Management System (RDBMS) would also be ill-advised. These tools were designed to tackle specific types of tasks; even though they can be co-opted to solve other problems, the cost of adapting the tools to do so would be enormous. It is a virtual equivalent of trying to fit a square peg in a round hole. Polyglot persistence: an example For example, consider a company that sells musical instruments and accessories online (and in a network of shops). At a high-level, there are a number of problems that a company needs to solve to be successful: Attract customers to its stores (both virtual and physical). Present them with relevant products (you would not try to sell a drum kit to a pianist, would you?!). Once they decide to buy, process the payment and organize shipping. To solve these problems a company might choose from a number of available technologies that were designed to solve these problems: Store all the products in a document-based database such as MongoDB, Cassandra, DynamoDB, or DocumentDB. There are multiple advantages of document databases: flexible schema, sharding (breaking bigger databases into a set of smaller, more manageable ones), high availability, and replication, among others. Model the recommendations using a graph-based database (such as Neo4j, Tinkerpop/Gremlin, or GraphFrames for Spark): such databases reflect the factual and abstract relationships between customers and their preferences. Mining such a graph is invaluable and can produce a more tailored offering for a customer. For searching, a company might use a search-tailored solution such as Apache Solr or ElasticSearch. Such a solution provides fast, indexed text searching capabilities. Once a product is sold, the transaction normally has a well-structured schema (such as product name, price, and so on.) To store such data (and later process and report on it) relational databases are best suited. With polyglot persistence, a company always chooses the right tool for the right job instead of trying to coerce a single technology into solving all of its problems. Read next: How to optimize Hbase for the Cloud [Tutorial] The trouble with Smart Contracts Indexing, Replicating, and Sharding in MongoDB [Tutorial]

0
0
8059

article-image-the-trouble-with-smart-contracts

Guest Contributor

03 Jul 2018

6 min read

The trouble with Smart Contracts

Guest Contributor

03 Jul 2018

6 min read

The government of Tennessee now officially recognizes Smart Contracts. That’s great news if we speak in terms of the publicity blockchain will receive. By virtue of such events, the Blockchain technology and all that’s related to it are drawing closer to becoming a standard way of how things work. However, the practice shows that the deeper you delve into the nuances of Blockchain, the more you understand that we are at the very beginning of quite a long and so far uncertain path. Before we investigate Smart Contracts on the back of a Tennessee law, let’s look at the concept in lay terms. Traditional Contract vs Smart Contract A traditional contract is simply a notarized piece of paper that details actions that are to be performed under certain conditions. It doesn’t control the actions fulfillment, but only assures it. Smart Contract is just like a paper contract; it specifies the conditions. Along with that, since a smart contract is basically a program code, it can carry out actions (which is impossible when we deal with the paper one). Most typically, smart contracts are executed in a decentralized environment, where: Anyone can become a validator and verify the authenticity of correct smart contract execution and the state of the database. Distributed and independent validators supremely minimize the third-party reliance and give confidence concerning unchangeability of what is to be done. That’s why, before putting a smart contract into action you should accurately check it for bugs. Because you won’t be able to make changes once it’s launched. All assets should be digitized. And all the data that may serve as a trigger for smart contract execution must be located within one database (system). What are oracles? There’s a popular myth that smart contracts in Ethereum can take external data from the web and use it in their environment (for example, smart contract transfers money to someone who won the bet on a football match results). You can not do that, because a smart contract only relies on the data that’s on the Ethereum blockchain. Still, there is a workaround. The database (Ethereum’s, in our case) can contain so-called oracles — ‘trusted’ parties that collect data from ‘exterior world’ and deliver it to smart contracts. For more precision, it is necessary to choose a wide range of independent oracles that provide smart contract with information. This way, you minimize the risk of their collusion. Smart Contract itself is only a piece of code For a better understanding, take a look at what Pavel Kravchenko — Founder of Distributed Lab has written about Smart Contracts on his Medium post: “A smart contract itself is a piece of code. The result of this code should be the agreement of all participants of the system regarding account balances (mutual settlements). From here indirectly it follows that a smart contract cannot manage money that hasn’t been digitized. Without a payment system that provides such opportunity (for example, Bitcoin, Ethereum or central bank currency), smart contracts are absolutely helpless!” Smart Contracts under the Tennessee law Storing data on the blockchain is now a legit thing to do in Tennessee. Here are some of the primary conditions stipulated by the law: Records or contracts secured through the blockchain are acknowledged as electronic records. Ownership rights of certain information stored on blockchain must be protected. Smart Contract is considered as an event-driven computer program, that’s executed on an electronic, distributed, decentralized, shared, and replicated ledger that is used to automate transactions. Electronic signatures and contracts secured through the blockchain technologies now have equal legal standing with traditional types of contracts and signatures. It is worth noting that the definition of a smart contract is pretty clear and comprehensive here. But, unfortunately, it doesn’t let the matter rest and there are some questions that were not covered: How can smart contracts and the traditional ones have equal legal standings if the functionality of a smart contract is much broader? Namely, it performs actions, while traditional contract only assures them. How will asset digitization be carried out? Do they provide any requirements for the Smart Contract source code or some normative audit that is to be performed in order to minimize bugs risk? The problem is not with smart contracts, but with creating the ecosystem around them. Unfortunately, it is impossible to build uniform smart-contract-based relationships in our society simply because the regulator has officially recognized the technology. For example, you won’t be able to sell your apartment via Smart Contract functionality if there won’t be a regulatory base that considers: The specified blockchain platform on which smart contract functionality is good enough to sustain a broad use. The way assets are digitized. And it’s not only for digital money transactions that you will be using smart contracts. You can use smart contracts to store any valuable information, for example, proprietary rights on your apartment. Who can be the authorized party/oracle that collects the exterior data and delivers it to the Smart Contract (Speaking of apartments, it is basically the notary, who should verify such parameters as ownership of the apartment, its state, even your existence, etc) So, it’s true. A smart contract itself is a piece of code and objectively is not a problem at all. What is a problem, however, is preparing a sound basis for the successful implementation of Smart Contracts in our everyday life. Create and launch a mechanism that would allow the connection of two entirely different gear wheels: smart contracts in its digital, decentralized and trustless environment the real world, where we mostly deal with the top-down approach and have regulators, lawyers, courts, etc. FAE (Fast Adaptation Engine): iOlite’s tool to write Smart Contracts using machine translation Blockchain can solve tech’s trust issues – Imran Bashir A brief history of Blockchain About the Expert, Dr. Pavel Kravchenko Dr. Pavel Kravchenko is the Founder of Distributed Lab, blogger, cryptographer and Ph.D. in Information Security. Pavel is working in blockchain industry since early 2014 (Stellar). Pavel's expertise is mostly focused on cryptography, security & technological risks, tokenization. About Distributed Lab Distributed Lab is a blockchain expertise center, with a core mission to develop cutting-edge enterprise tokenization solutions, laying the groundwork for the coming “Financial Internet”. Distributed Lab organizes dozens of events every year for the Crypto community – ranging from intensive small-format meetups and hackathons to large-scale international conferences which draw 1000+ attendees.

0
0
2504

article-image-2018-year-of-graph-databases

Amey Varangaonkar

04 May 2018

5 min read

2018 is the year of graph databases. Here's why.

Amey Varangaonkar

04 May 2018

5 min read

With the explosion of data, businesses are looking to innovate as they connect their operations to a whole host of different technologies. The need for consistency across all data elements is now stronger than ever. That’s where graph databases come in handy. Because they allow for a high level of flexibility when it comes to representing your data and also while handling complex interactions within different elements, graph databases are considered by many to be the next big trend in databases. In this article, we dive deep into the current graph database scene, and list out 3 top reasons why graph databases will continue to soar in terms of popularity in 2018. What are graph databases, anyway? Simply put, graph databases are databases that follow the graph model. What is a graph model, then? In mathematical terms, a graph is simply a collection of nodes, with different nodes connected by edges. Each node contains some information about the graph, while edges denote the connection between the nodes. How are graph databases different from the relational databases, you might ask? Well, the key difference between the two is the fact that graph data models allow for more flexible and fine-grained relationships between data objects, as compared to relational models. There are some more differences between the graph data model and the relational data model, which you should read through for more information. Often, you will see that graph databases are without a schema. This allows for a very flexible data model, much like the document or key/value store database models. A unique feature of the graph databases, however, is that they also support relationships between the data objects like a relational database. This is useful because it allows for a more flexible and faster database, which can be invaluable to your project which demands a quicker response time. Image courtesy DB-Engines The rise in popularity of the graph database models over the last 5 years has been stunning, but not exactly surprising. If we were to drill down the 3 key factors that have propelled the popularity of graph databases to a whole new level, what would they be? Let’s find out. Major players entering the graph database market About a decade ago, the graph database family included just Neo4j and a couple of other less-popular graph databases. More recently, however, all the major players in the industry such as Oracle (Oracle Spatial and Graph), Microsoft (Graph Engine), SAP (SAP Hana as a graph store) and IBM (Compose for JanusGraph) have come up with graph offerings of their own. The most recent entrant to the graph database market is Amazon, with Amazon Neptune announced just last year. According to Andy Jassy, CEO of Amazon Web Services, graph databases are becoming a part of the growing trend of multi-model databases. Per Jassy, these databases are finding increased adoption on the cloud as they support a myriad of useful data processing methods. The traditional over-reliance on relational databases is slowly breaking down, he says. Rise of the Cypher Query Language With graph databases slowly getting mainstream recognition and adoption, the major companies have identified the need for a standard query language for all graph databases. Similar to SQL, Cypher has emerged as a standard and is a widely-adopted alternative to write efficient and easy to understand graph queries. As of today, the Cypher Query Language is used in popular graph databases such as Neo4j, SAP Hana, Redis graph and so on. The OpenCypher project, the project that develops and maintains Cypher, has also released Cypher for popular Big Data frameworks like Apache Spark. Cypher’s popularity has risen tremendously over the last few years. The primary reason for this is the fact that like SQL, Cypher’s declarative nature allows users to state the actions they want performed on their graph data without explicitly specifying them. Finding critical real-world applications Graph databases were in the news as early as 2016, when the Panama paper leaks were revealed with the help of Neo4j and Linkurious, a data visualization software. In more recent times, graph databases have also found increased applications in online recommendation engines, as well as for performing tasks that include fraud detection and managing social media. Facebook’s search app also uses graph technology to map social relationships. Graph databases are also finding applications in virtual assistants to drive conversations - eBay’s virtual shopping assistant is an example. Even NASA uses the knowledge graph architecture to find critical data. What next for graph databases? With growing adoption of graph databases, we expect graph-based platforms to soon become the foundational elements of many corporate tech stacks. The next focus area for these databases will be practical implementations such as graph analytics and building graph-based applications. The rising number of graph databases would also mean more competition, and that is a good thing - competition will bring more innovation, and enable incorporation of more cutting-edge features. With a healthy and steadily growing community of developers, data scientists and even business analysts, this evolution may be on the cards, sooner than we might expect. Amazon Neptune: A graph database service for your applications When, why and how to use Graph analytics for your big data

0
0
6621

article-image-why-oracle-losing-database-race

Aaron Lazar

06 Apr 2018

3 min read

Why Oracle is losing the Database Race

Aaron Lazar

06 Apr 2018

3 min read

When you think of databases, the first thing that comes to mind is Oracle or IBM. Oracle has been ruling the database world for decades now, and it has been able to acquire tonnes of applications that use its databases. However, that’s changing now, and if you didn’t know already, you might be surprised to know that Oracle is losing the database race. Oracle = Goliath Oracle was and still is ranked number one among databases, owing to its legacy in the database ballpark. Source - DB Engines The main reason why Oracle has managed to hold its position is because of lock-in, a CIO’s worst nightmare. Migrating data that’s accumulated over the years is not a walk in the park and usually has top management flinching every time it’s mentioned. Another reason is because Oracle is known to be aggressive when it comes to maintaining and enforcing licensing terms. You won’t be surprised to find Oracle ‘agents’ at the doorstep of your organisation, slapping you with a big fine for non-compliance! Oracle != Goliath for everyone You might wonder whether even the biggies are in the same position, locked-in with Oracle. Well, the Amazons and Salesforces of the world have quietly moved away from lock-in hell and have their applications now running on open-source projects. In fact, Salesforce plans to be completely free of Oracle databases by 2023 and has even codenamed this project “Sayonara”. I wonder what inspired the name! Enter the “Davids” of Databases While Oracle’s databases have been declining, alternatives like SQL Server and PostgreSQL have been steadily growing. SQL Server has been doing it in leaps and bounds, with a growth rate of over 30%. Amazon and Microsoft’s cloud based databases have seen close to 10x growth. While one might think that all Cloud solutions would have dominated the database world, databases like Google Cloud SQL and IBM Cognos have been suffering very slow to no growth as the question of lock-in arises again, only this time with a cloud vendor. MongoDB has been another shining star in the database race. Several large organisations like HSBC, Adobe, Ebay, Forbes and MTV have adopted MongoDB as their database solution. Newer organisations have been resorting to adopt these databases instead to looking to Oracle. However, it’s not really eating into Oracle’s existing market, at least not yet. Is 18c Oracle’s silver bullet? Oracle bragged a lot about 18c, last year, positioning it as a database that needs little to no human interference thanks to its ground-breaking machine learning; one that operates at less than 30 minutes of downtime a year and many more features. Does this make Microsoft and Amazon break into a sweat? Hell no! Although Oracle has strategically positioned 18c as a database that lowers operational cost by cutting down on the human element, it still is quite expensive when compared to its competitors - they haven’t dropped their price one bit. Moreover, it can’t really automate “everything” and there’s always a need for a human administrator - not really convincing enough. Quite naturally customers will be drawn towards competition. In the end, the way I look at it, Oracle already had a head start and is now inches from the elusive finish line, probably sniggering away at all the customers that it has on a leash. All while cloud databases are slowly catching up and will soon be leaving Oracle in a heap of dirt. Reminds me of that fable mum used to read to me...what’s it called...The hare and the tortoise.

0
0
5565

Packt

09 Oct 2017

2 min read

Beyond the Bitcoin: How cryptocurrency can make a difference in hurricane disaster relief

Packt

09 Oct 2017

2 min read

More than $350 worth of cryptocurrency guides offered in support of globalgiving.com During Cybersecurity Month, Packt is partnering with Humble Bundle and three other technology publishers – Apress, John Wiley & Sons, No Starch Press - for the Humble Book Bundle: Bitcoin & Cryptocurrency, a starter eBook library of blockchain programming guides offered for as little as $1, with each purchase supporting hurricane disaster relief efforts through the nonprofit, GlobalGiving.org. Packed with over $350 worth of valuable developer information, the bundle offers coding instruction and business insights at every level – from beginner to advanced. Readers can learn how to code with Ethereum while at the same time learning about the latest developments in cryptocurrency and emerging business uses of blockchain programming. As with all Humble Bundles, customers can choose how their purchase dollars are allocated, between the publishers and charity, and can even “gift” a bundle purchase to others as their donation. Donations for as little as $1USD can support hurricane relief. The online magazine retailer, Zinio, will be offering a limited time promotion of some of their best tech magazines as well. You can find the special cryptocurrency package here. "It's very unusual for tech publishers who normally would compete to come together to do good work for a good cause," said Kelley Allen, Director of Books at Humble Bundle. "Humble Books is really pleased to be able to support their efforts by offering this collection of eBooks about such a timely and cutting-edge subject of Cryptocurrency". The package of 15 eBooks includes recent titles Bitcoin for Dummies, The Bitcoin Big Bang, BlockChain Basics, Bitcoin for the Befuddled, Mastering Blockchain, and the eBook bestseller, Introducing Ethereum and Solidity. The promotional bundles are being released globally in English, and are available in PDF, .ePub and .Mobi formats. The offer runs October 9 through October 23, 2017.

0
0
2410

article-image-mongodb-issues-you-should-pay-attention

Tess Hsu

21 Oct 2016

4 min read

MongoDB: Issues You Should Pay Attention To

Tess Hsu

21 Oct 2016

4 min read

MongoDB, founded in 2007 with more than 15 million downloads, excels at supporting real-time analytics for big data applications. Rather than storing data in tables made out of individual rows, MongoDB stores it in collections made out of JSON documents. But, why use MongoDB? How does it work? What issues should you pay attention to? Let’s answer these questions in this post. MongoDB, a NoSQL database MongoDB is a NoSQL database, and NoSQL == Not Only SQL. The data structure is combined with keyvalue, like JSON. The data type is very flexible, but flexibility can be a problem if it’s not defined properly. Here are some good reasons you should use MongoDB: If you are a front-end developer, MongoDB is much easier to learn than mySQL, because the MongoDB base language is JavaScript and JSON. MongoDB works well for big data, because for instance, you can de-normalize and flatten 6 tables into just 2 tables. MongoDB is document-based. So it is good to use if you have a lot of single types of documents. So, now let’s examine how MongoDB works, starting with installing MongoDB: Download MongoDB from https://www.mongodb.com/download-center#community. De-zip your MongoDB file. Create a folder for the database, for example, Data/mydb. Open cmd to the MongoDB path, $ mongod –dbpath ../data/mydb. $ mongo , to make sure that it works. $ show dbs, and you can see two databases: admin and local. If you need to shut down the server, use $db.shutdownServer(). MongoDB basic usage Now that you have MongoDB on your system, let’s examine some basic usage of MongoDB, covering insertion of a document, removal of a document, and how to drop a collection from MongoDB. To insert a document, use the cmd call. Here we use employee as an example to insert a name, an account, and a country. You will see the data shown in JSON: To remove the document: db.collection.remove({ condition }), justOne) justOne: true | false, set to remove the first data, but if you want to remove them all, use db.employee.remove({}). To drop a collection (containing multiple documents) from the database, use: db.collection.drop() For more commands, please look at the MongoDB documentation. What to avoid Let’s examine some points that you should note when using MongoDB: Not easy to change to another database: When you choose MongoDB, it isn’t like other RDBMSes. It can be difficult to change, for example, from MongoDB to Couchbase. No support for ACID: ACID (Atomicity, Consistency, Isolation, Durability) is the basic item of transactions, but most NoSQL databases don’t guarantee ACID, so you need more technical skills in order to do this. No support for JOIN: Since the NoSQL database is non-relational, it does not support JOIN. Document limited: MongoDB uses stock data in documents, and these documents are in JSON format. Because of this, MongoDB has a limited data size, and the latest version supports up to 16 MB per document. Filter search has to correctly define lowercase/uppercase: For example: db.people.find（{name: ‘Russell’}） and db.people.find（{name: ‘ russell’}） are different. You can filter by regex, such as db.people.find（{name:/Russell/i}）, but this will affect performance. I hope this post has provided you with some important points about MongoDB which will help you decide if you have a big data solution that is a good fit for using this NoSQL database. About the author Tess Hsu is a UI design and front-end programmer. He can be found on GitHub.

0
0
4131

article-image-8-nosql-databases-compared

Janu Verma

17 Jun 2015

5 min read

8 NoSQL Databases Compared

Janu Verma

17 Jun 2015

5 min read

NoSQL, or non-relational databases, are increasingly used in big data and real-time web applications. These databases are non-relational in nature and they provide a mechanism for storage and the retrieval of information that is not tabular. There are many advantages of using NoSQL database: Horizontal Scalability Automatic replication (using multiple nodes) Loosely defined or no schema (Huge advantage, if you ask me!) Sharding and distribution Recently we were discussing the possibility of changing our data storage from HDF5 files to some NoSQL system. HDF5 files are great for the storage and retrieval purposes. But now with huge data coming in we need to scale up, and also the hierarchical schema of HDF5 files is not very well suited for all sorts of data we are using. I am a bioinformatician working on data science applications to genomic data. We have genomic annotation files (GFF format), genotype sequences (FASTA format), phenotype data (tables), and a lot of other data formats. We want to be able to store data in a space and memory efficient way and also the framework should facilitate fast retrieval. I did some research on the NoSQL options and prepared this cheat-sheet. This will be very useful for someone thinking about moving their storage to non-relational databases. Also, data scientists need to be very comfortable with the basic ideas of NoSQL DB's. In the course Introduction to Data Science by Prof. Bill Howe (UWashinton) on Coursera, NoSQL DB's formed a significant part of the lectures. I highly recommend the lectures on these topics and this course in general. This cheat-sheet should also assist aspiring data scientists in their interviews. Some options for NoSQL databases: Membase: This is key-value type database. It is very efficient if you only need to quickly retrieve a value according to a key. It has all of the advantages of memcached when it comes to the low cost of implementation. There is not much emphasis on scalability, but lookups are very fast. It has a JSON format with no predefined schema. The weakness of using it for important data is that it's a pure key-value store, and thus is not queryable on properties. MongoDB: If you need to associate a more complex structure, such as a document to a key, then MongoDB is a good option. With a single query, you are going to retrieve the whole document and it can be a huge win. However using these documents like simple key/value stores would not be as fast and as space-efficient as Membase. Documents are the basic unit. Documents are in JSON format with no predefined schema. It makes integration of data easier and faster. Berkeley DB: It stores records in key-value pairs. Both key and value can be arbitrary byte strings, and can be of variable lengths. You can put native programming language data structures into the database without converting to a foreign record first. Storage and retrieval are very simple, but the application needs to know what the structure of a key and a value is in advance, it can't ask the DB. Simple data access services. No limit to the data types that can be stored. No special support for binary large objects (unlike some others) Berkeley DB v/s MongoDB: Berkeley DB has no partitioning while MongoDB supports sharding. MongoDB has some predefined data types like float, string, integer, double, boolean, date, and so on. Berkeley DB has key-value store and MongoDb has documents. Both are schema free. Berkeley DB has no support for Python, for example, although there are many third parties libraries. Redis: If you need more structures like lists, sets, ordered sets and hashes, then Redis is the best bet. It's very fast and provides useful data-structures. It just works, but don't expect it to handle every use-case. Nevertheless, it is certainly possible to use Redis as your primary data-store. But it is used less for distributed scalability, but optimizes high performance lookups at the cost of no longer supporting relational queries. Cassandra: Each key has values as columns and columns are grouped together into sets called column families. Thus each key identifies a row of a variable number of elements. A column family contains rows and columns. Each row is uniquely identified by a key. And each row has multiple columns. Think of a column family as a table, each key-value pair being a row. Unlike RDBMS, different rows in a column family don't have to share the same set of columns, and a column may be added to one or multiple rows at any time. A hybrid between a key-value and a column-oriented database. Has a partially defined schema. Can handle large amounts of data across many servers (clusters), is fault-tolerant and robust. Examples were originally written by Facebook for the Inbox search, and later replaced by HBase. HBase: It is modeled after Google's Bigtable DB. The deal use for HBase is in the situations when you need improved flexibility, great performance, scaling and have Big Data. The data structure is similar to Cassandra where you have column families. Built on Hadoop (HDFS), and can do MapReduce without any external support. Very efficient for storing sparse data . Big data (2 billion rows) is easy to deal with. Examples scalable email/messaging system with search. HBase V/S Cassandra: Hbase is more suitable for data warehousing and large scale data processing and analysis (indexing the web as in a search engine) and Cassandra is more apt for real time transaction processing and the serving of interactive data. Cassandra is more write-centric and HBase is more read-centric. Cassandra has multi- data center support, which can be very useful. Resources NoSQL explained Why NoSQL Big Table About the Author Janu Verma is a Quantitative Researcher at the Buckler Lab, Cornell University, where he works on problems in bioinformatics and genomics. His background is in mathematics and machine learning and he leverages tools from these areas to answer questions in biology.

0
0
4542

Akram Hussain

31 Oct 2014

4 min read

Top 5 NoSQL Databases

Akram Hussain

31 Oct 2014

4 min read

NoSQL has seen a sharp rise in both adoption and migration from the tried and tested relational database management systems. The open source world has accepted it with open arms, which wasn’t the case with large enterprise organisations that still prefer and require ACID-compliant databases. However, as there are so many NoSQL databases, it’s difficult to keep track of them all! Let’s explore the most popular and different ones available to us: 1 - Apache Cassandra Apache Cassandra is an open source NoSQL database. Cassandra is a distributed database management system that is massively scalable. An advantage of using Cassandra is its ability to manage large amounts of structured, semi-structured, and unstructured data. What makes Cassandra more appealing as a database system is its ability to ‘Scale Horizontally’, and it’s one of the few database systems that can process data in real time and generate high performance and maintain high availability. The mixture of a column-oriented database with a key-value store means not all rows require a column, but the columns are grouped, which is what makes them look like tables. Cassandra is perfect for ‘mission critical’ big data projects, as Cassandra offers ‘no single point of failure’ if a data node goes down. 2 - MongoDB MongoDBis an open source schemaless NoSQL database system; its unique appeal is that it’s a ‘Document database’ as opposed to a relational database. This basically means it’s a ‘data dumpster’ that’s free for all. The added benefit in using MongoDB is that it provides high performance, high availability, and easy scalability (auto-sharding) for large sets of unstructured data in JSON-like files. MongoDB is the ultimate opposite to the popular MySQL. MySQL data has to be read in rows and columns, which has its own set of benefits with smaller sets of data. 3 - Neo4j Neo4j is an open source NoSQL ‘graph-based database’. Neo4j is the frontrunner of the graph-based model. As a graph database, it manages and queries highly connected data reliably and efficiently. It allows developers to store data more naturally from domains such as social networks and recommendation engines. The data collected from sites and applications are initially stored in nodes that are then represented as graphs. 4 - Hadoop Hadoop is easy to look over as a NoSQL database due to its ecosystem of tools for big data. It is a framework for distributed data storage and processing, designed to help with huge amounts of data while limiting financial and processing-time overheads. Hadoop includes a database known as HBase, which runs on top of HDFS and is a distributed, column-oriented data store. HBase is also better known as a distributed storage system for Hadoop nodes, which are then used to run analytics with the use of MapReduce V2, also known as Yarn. 5 - OrientDB OrientDB has been included as a wildcard! It’s a very interesting database and one that has everything going for it, but has always been in the shadows of Neo4j. Orient is an open source NoSQL hybrid graph-document database that was developed to combine the flexibility of a document database with the complexity of a graph database (Mongo and Neo4j all in one!). With the growth of complex and unstructured data (such as social media), relational databases were not able to handle the demands of storing and querying this type of data. Document databases were developed as one solution and visualizing them through nodes was another solution. Orient has combined both into one, which sounds awesome in theory but might be very different in practice! Whether the Hybrid approach works and is adopted remains to be seen.

0
0
4180

Tech Guides - Databases

PostgreSQL committer Stephen Frost shares his vision for PostgreSQL version 12 and beyond

What does a data science team look like?

Building a scalable PostgreSQL solution

Why Neo4j is the most popular graph database

Polyglot persistence: what is it and why does it matter?

The trouble with Smart Contracts

Trending Topics

2018 is the year of graph databases. Here's why.

Why Oracle is losing the Database Race

Beyond the Bitcoin: How cryptocurrency can make a difference in hurricane disaster relief

MongoDB: Issues You Should Pay Attention To

8 NoSQL Databases Compared

Top 5 NoSQL Databases