MongoDB configuration and best practices
In this section, we will present some of the best practices around operations, schema design, durability, replication, sharding, security, and AWS. Further information on when to implement these best practices will be presented in later chapters.
Operational best practices
As a database, MongoDB is built with developers in mind, and it was developed during the web era, so it does not require as much operational overhead as traditional RDBMS. That being said, there are some best practices that need to be followed to be proactive and achieve high availability goals.
In order of importance, the best practices are as follows:
- Mind the location of your data files: Data files can be mounted anywhere by using the
--dbpath
command-line option. It is really important to ensure that data files are stored in partitions with sufficient disk space, preferably XFS, or at the very least Ext4. - Keep yourself updated with versions: Before version 5, there was a different versioning naming convention. Even major numbered versions are the stable ones. So, 3.2 is stable, whereas 3.3 is not. In this example, 3.3 is the developmental version that will eventually materialize into the stable 3.4 version. It is a good practice to always update to the latest updated security version (which, at the time of writing this book, is 4.0.2) and to consider updating as soon as the next stable version comes out (4.2, in this example).
Version 5 has become cloud-first. The newest versions are automatically updated in MongoDB Atlas with the ability to opt out of them, whereas all versions are available to download for evaluation and development purposes. Chapter 3, MongoDB CRUD Operations, goes into more detail about the new rapid release schedule and how it affects developers and architects.
- Use MongoDB Cloud monitoring: The free MongoDB, Inc. monitoring service is a great tool to get an overview of a MongoDB cluster, notifications, and alerts and to be proactive about potential issues.
- Scale up if your metrics show heavy use: Do not wait until it is too late. Utilizing more than 65% in CPU or RAM, or starting to notice disk swapping, should be the threshold to start thinking about scaling, either vertically (by using bigger machines) or horizontally (by sharding).
- Be careful when sharding: Sharding is a strong commitment to your shard key. If you make the wrong decision, it might be really difficult to go back from an operational perspective. When designing for sharding, architects need to take deep consideration of the current workloads (reads/writes) and what the current and expected data access patterns are. Live resharding, which was introduced in version 5, mitigates the risk compared to previous versions, but it’s still better to spend more time upfront instead of resharding after the fact. Always use the shard key in queries or else MongoDB will have to query all shards in the cluster, negating the major sharding advantage.
- Use an application driver maintained by the MongoDB team: These drivers are supported and tend to get updated faster than drivers with no official support. If MongoDB does not support the language that you are using yet, please open a ticket in MongoDB’s JIRA tracking system.
- Schedule regular backups: No matter whether you are using standalone servers, replica sets, or sharding, a regular backup policy should also be used as a second-level guard against data loss. XFS is a great choice as a filesystem, as it can perform snapshot backups.
- Manual backups should be avoided: When possible, regular, automated backups should be used. If we need to resort to a manual backup, we can use a hidden member in a replica set to take the backup from. We have to make sure that we are using
db.fsync
with{lock: true}
in this member, to get the maximum consistency at this node, along with having journaling turned on. If this volume is on AWS, we can get away with taking an EBS snapshot straight away. - Enable database access control: Never put a database into a production system without access control. Access control should be implemented at a node level, by a proper firewall that only allows access to specific application servers to the database, and at a DB level, by using the built-in roles or defining custom-defined ones. This has to be initialized at start-up time by using the
--auth
command-line parameter and can be configured by using theadmin
collection. - Test your deployment using real data: Since MongoDB is a schema-less, document-oriented database, you might have documents with varying fields. This means that it is even more important than with an RDBMS to test using data that resembles production data as closely as possible. A document with an extra field of an unexpected value can make the difference between an application working smoothly or crashing at runtime. Try to deploy a staging server using production-level data, or at least fake your production data in staging, by using an appropriate library, such as Faker for Ruby.
Schema design best practices
MongoDB is schema-less, and you need to design your collections and indexes to accommodate for this fact:
- Index early and often: Identify common query patterns, using cloud monitoring, the GUI that MongoDB Compass offers, or logs. Analyzing the results, you should create indexes that cover the most common query patterns, using as many indexes as possible at the beginning of a project.
- Eliminate unnecessary indexes: This is a bit counter-intuitive to the preceding suggestion, but monitor your database for changing query patterns, and drop the indexes that are not being used. An index will consume RAM and I/O, as it needs to be stored and updated along with the documents in the database. Using an aggregation pipeline and
$indexStats
, a developer can identify the indexes that are seldom being used and eliminate them. - Use a compound index, rather than index intersection: Most of the time, querying with multiple predicates (A and B, C or D and E, and so on) will work better with a single compound index than with multiple simple indexes. Also, a compound index will have its data ordered by field, and we can use this to our advantage when querying. An index on fields A, B, and C will be used in queries for A, (A,B), (A,B,C), but not in querying for (B,C) or (C).
- Low selectivity indexes: Indexing a field on gender, for example, will statistically return half of our documents back, whereas an index on the last name will only return a handful of documents with the same last name.
- Use regular expressions: Again, since indexes are ordered by value, searching using a regular expression with leading wildcards (that is,
/.*BASE/
) won’t be able to use the index. Searching with trailing wildcards (that is,/DATA.*/
) can be efficient, as long as there are enough case-sensitive characters in the expression. - Avoid negation in queries: Indexes are indexing values, not the absence of them. Using
NOT
in queries, instead of using the index, can result in full table scans. - Use partial indexes: If we need to index a subset of the documents in a collection, partial indexes can help us to minimize the index set and improve performance. A partial index will include a condition on the filter that we use in the desired query.
- Use document validation: Use document validation to monitor for new attributes being inserted into your documents and decide what to do with them. With document validation set to warn, we can keep a log of documents that were inserted with new, never-seen-before attributes that we did not expect during the design phase and decide whether we need to update our index or not.
- Use MongoDB Compass: MongoDB’s free visualization tool is great for getting a quick overview of our data and how it grows over time.
- Respect the maximum document size of 16 MB: The maximum document size for MongoDB is 16 MB. This is a fairly generous limit, but it is one that should not be violated under any circumstances. Allowing for documents to grow unbounded should not be an option, and, as efficient as it might be to embed documents, we should always keep in mind that this should be kept under control. Additionally, we should keep track of the average and maximum document sizes using monitoring or the
bsonSize()
method and the aggregation operator. - Use the appropriate storage engine: MongoDB has introduced several new storage engines since version 3.2. The in-memory storage engine should be used for real-time workloads, whereas the encrypted storage engine (only available in MongoDB Enterprise Edition) should be the engine of choice when there are strict requirements around data security. Otherwise, the default WiredTiger storage engine is the best option for general-purpose workloads.
Examining some schema design best practices, we will move on to the best practices for write durability as of MongoDB version 6.
Best practices for write durability
Write durability can be fine-tuned in MongoDB, and, according to our application design, it should be as strict as possible, without affecting our performance goals.
Fine-tune the data and flush it to the disk interval in the WiredTiger storage engine; the default is to flush data to the disk every 60 seconds after the last checkpoint. This can be changed by using the --wiredTigerCheckpointDelaySecs
command-line option.
MongoDB version 5 has changed the default settings for read and write concerns.
The default write concern is now majority
writes, which means that in a replica set of three nodes (with one primary and two secondaries), the operation returns as soon as two of the nodes acknowledge it by writing it to the disk. Writes always go to the primary and then get propagated asynchronously to the secondaries. In this way, MongoDB eliminates the possibility of data rollback in the event of a node failure.
If we use arbiters in our replica set, then writes will still be acknowledged solely by the primary if the following formula resolves to true:
#arbiters > #nodes*0.5 - 1
For example, in a replica set of three nodes of which one is the arbiter and two are storing data, this formula resolves to the following:
1 > 3*0.5 - 1 ... 1 > 0.5 ... true
Note
MongoDB 6 restricts the number of arbiters to a maximum of one.
The default read concern is now local
instead of available
, which mitigates the risk of returning orphaned documents for reads in sharded collections. Orphaned documents might be returned during chunk migrations, which can be triggered either by MongoDB or, since version 5, also by the user when applying live resharding to the sharded collection.
Multi-document ACID transactions and the transactional guarantees that they have provided since MongoDB 4.2, coupled with the introduction of streaming replication and replicate-before-journaling behavior, have improved replication performance. Additionally, they allow for more durable and consistent default write and read concerns without affecting performance as much. The new defaults are promoting durability and consistent reads and should be carefully evaluated before changing them.
Best practices for replication
Under the right conditions, replica sets are MongoDB’s mechanism to provide redundancy, high availability, and higher read throughput. In MongoDB, replication is easy to configure and focuses on operational terms:
- Always use replica sets: Even if your dataset is currently small, and you don’t expect it to grow exponentially, you never know when that might happen. Also, having a replica set of at least three servers helps to design for redundancy, separating the workloads between real time and analytics (using the secondary) and having data redundancy built-in from day one. Finally, there are some corner cases that you will identify earlier by using a replica set instead of a single standalone server, even for development purposes.
- Use a replica set to your advantage: A replica set is not just for data replication. We can (and, in most cases, should) use the primary server for writes and preference reads from one of the secondaries to offload the primary server. This can be done by setting read preferences for reads, along with the correct write concern, to ensure that writes propagate as needed.
- Use an odd number of replicas in a MongoDB replica set: If a server is down or loses connectivity with the rest of them (network partitioning), the rest have to vote as to which one will be elected as the primary server. If we have an odd number of replica set members, we can guarantee that each subset of servers knows if they belong to the majority of the minority of the replica set members. If we cannot have an odd number of replicas, we need to have one extra host set as an arbiter, with the sole purpose of voting in the election process. Even a micro-instance in EC2 could serve this purpose.
Best practices for sharding
Sharding is MongoDB’s solution for horizontal scaling. In Chapter 9, Monitoring, Backup, and Security, we will go over its usage in more detail, but the following list offers some best practices, based on the underlying data architecture:
- Think about query routing: Based on different shard keys and techniques, the
mongos
query router might direct the query to some (or all) of the members of a shard. It is important to take our queries into account when designing sharding. This is so that we don’t end up with our queries hitting all of our shards. - Use tag-aware sharding: Tags can provide more fine-grained distribution of data across our shards. Using the right set of tags for each shard, we can ensure that subsets of data get stored in a specific set of shards. This can be useful for data proximity between application servers, MongoDB shards, and the users.
Best practices for security
Security is always a multi-layered approach, and the following recommendations do not form an exhaustive list; they are just the bare basics that need to be done in any MongoDB database:
- Always turn authentication on. There are multiple hacks over the years where open MongoDB servers have been hacked for fun or profit such as being backed up and deleted to extort admins to pay. It is a good practice to set up authentication even in non-production environments to decrease the possibility of human error.
- The HTTP status interface should be disabled.
- The RESTful API should be disabled.
- The JSON API should be disabled.
- Connect to MongoDB using SSL.
- Audit the system activity.
- Use a dedicated system user to access MongoDB with appropriate system-level access.
- Disable server-side scripting if it is not needed. This will affect MapReduce, built-in
db.group()
commands, and$where
operations. If they are not used in your code base, it is better to disable server-side scripting at startup by using the--noscripting
parameter or settingsecurity.javascriptEnabled
to false in the configuration file.
After examining the best practices for security in general, we will dive into what are the best practices for AWS deployments.
Best practices for AWS
When we are using MongoDB, we can use our own servers in a data center, a MongoDB-hosted solution such as MongoDB Atlas, or we can rent instances from Amazon by using EC2. EC2 instances are virtualized and share resources in a transparent way, with collocated VMs in the same physical host. So, there are some more considerations to take into account if you wish to go down that route, as follows:
- Use EBS-optimized EC2 instances.
- Get EBS volumes with provisioned I/O operations per second (IOPS) for consistent performance.
- Use EBS snapshotting for backup and restore.
- Use different availability zones for high availability and different regions for disaster recovery. Using different availability zones within each region that Amazon provides guarantees that our data will be highly available. Different regions should mostly be used for disaster recovery in case a catastrophic event ever takes out an entire region. A region might be EU-West-2 (for London), whereas an availability zone is a subdivision within a region; currently, three availability zones are available in the London region.
- Deploy globally, access locally.
- For truly global applications with users from different time zones, we should have application servers in different regions access the data that is closest to them, using the right read preference configuration in each server.