Without diving too deep into why, in this section we present some best practices around operations, schema design, durability, replication, sharding, and security. More information as to why and how to implement these best practices will be presented in the respective chapters and as always with best practices, these have to be taken with a pinch of salt.
MongoDB configuration and best practices
Operational best practices
MongoDB as a database is built with developers in mind and developed during the web era so does not require as much operational overhead as traditional RDBMSs. That being said, there are some best practices that need to be followed to be proactive and achieve high availability goals.
In order of importance (somewhat), here they are:
- Turn journaling on by default: Journaling uses a write ahead log to be able to recover in case a mongo server gets shut down abruptly. With MMAPv1 storage engine, journaling should be always on. With WiredTiger storage engine, journaling and checkpointing are used together to ensure data durability. In any case, it's a good practice to use journaling and fine tune the size of journals and frequency of checkpoints to avoid risk of data loss. In MMAPv1, the journal is flushed to disk every 100 ms by default. If MongoDB is waiting for the journal before acknowledging the write operation, the journal is flushed to disk every 30 ms.
- Your working set should fit in memory: Again, especially when using MMAPv1 the working set is best being less than the RAM of the underlying machine or VM. MMAPv1 uses memory mapped files from the underlying operating system which can benefit greatly if there isn't much swap happening between RAM and disk. WiredTiger on the other hand is much more efficient at using memory but still benefits greatly from the same principles. The working set is at maximum the datasize plus index size as reported by db.stats().
- Mind the location of your data files: Data files can be mounted anywhere using the --dbpath command line option. It is really important to make sure data files are stored in partitions with sufficient disk space, preferably XFS or at least Ext4.
- Keep yourself updated with versions: Odd major numbered versions are the stable ones. So, 3.2 is stable whereas 3.3 is not. In this example 3.3 is the development version that will eventually materialize into stable version 3.4 . It's a good practice to always update to the latest security updated version (3.4.3 at the time of writing) and consider updating as soon as the next stable version comes out (3.6 at this example).
- Use Mongo MMS to graphically monitor your service: MongoDB Inc's free monitoring service is a great tool to get an overview of a MongoDB cluster, notifications, and alerts and be proactive about potential issues.
- Scale up if your metrics show heavy use: Actually not really heavy usage. Key metrics of >65% in CPU, RAM, or if you are starting to notice disk swapping should be an alert to start thinking about scaling, either vertically by using bigger machines or horizontally by sharding.
- Be careful when sharding: Sharding is like a strong commitment to your shard key. If you make the wrong decision it may be really difficult operationally to go back. When designing for sharding, architects need to take a long and deep consideration of current workloads both in reads and also writes plus what the expected data access patterns are.
- Use an application driver maintained by the MongoDB team: These drivers are supported and in general get updated faster than their equivalents. If MongoDB does not support the language you are using yet, please open a ticket in MongoDB's JIRA tracking system.
- Schedule regular backups: No matter if you are using standalone servers, replica sets, or sharding, a regular backup policy should also be used as a second level guard against data loss. XFS is a great choice as a filesystem as it can perform snapshot backups.
- Manual backups should be avoided: Regular automated backups should be used when possible. If we need to resort to a manual backup then we can use a hidden member in a replica set to take the backup from. We have to make sure that we are using db.fsyncwithlock at this member to get the maximum consistency at this node, along with journaling turned on. If this volume is on AWS, we can get away with taking an EBS snapshot straight away.
- Enable database access control: Never, ever put a database in a production system without access control. Access control should both be implemented at a node level by a proper firewall that only allows access to specific application servers to the database and also in DB level by using the built-in roles, or defining custom defined ones. This has to be initialized at startup time by using the --auth command-line parameter and configured using the admin collection.
- Test your deployment using real data: MongoDB being a schema-less document oriented database means that you may have documents with varying fields. This means that it's even more important than with an RDBMS to test using data that resembles production data as closely as possible. A document with an extra field of an unexpected value can make the difference between an application working smoothly or crashing at runtime. Try to deploy a staging server using production level data or at least fake your production data in staging using an appropriate library like Faker for Ruby.
Schema design best practices
MongoDB is schema-less and you have to design your collections and indexes to accommodate for this fact:
- Index early and often: Identify common query patterns using MMS, Compass GUI, or logs and index for these early and using as many indexes as possible at the beginning of a project.
- Eliminate unnecessary indexes: A bit counter-intuitive to the preceding suggestion, monitor your database for changing query patterns and drop the indexes that aren't being used. An index will consume RAM and I/O as it needs to be stored and updated alongside with documents in the database. Using an aggregation pipeline and $indexStats a developer can identify indexes that are seldom being used and eliminate them.
- Use a compound index rather than index intersection: Querying with multiple predicates (A and B , C or D and E and so on) will most of the time work better with a single compound index than with multiple simple indexes. Also, a compound index will have its data ordered by field and we can use this to our advantage when querying. An index on fields A,B,C will be used in queries for A, (A,B), (A,B,C) but not in querying for (B,C) or (C) .
- Low selectivity indexes: Indexing a field on gender for example will statistically still return half of our documents back, whereas an index on last name will only return a handful of documents with the same last name.
- Use of regular expressions: Again, since indexes are ordered by value, searching using a regular expression with leading wildcards (that is, /.*BASE/) won't be able to use the index. Searching with trailing wildcards (that is, /DATA.*/) can be efficient as long as there are enough case sensitive characters in the expression.
- Avoid negation in queries: Indexes are indexing values, not the absence of them. Using NOT in queries can result in full table scans instead of using the index.
- Use partial indexes: If we need to index a subset of the documents in a collection, partial indexes can help us minimize the index set and improve performance. A partial index will include a condition on the filter that we use in the desired query.
- Use document validation: Use document validation to monitor for new attributes being inserted to your documents and decide what to do with them. With document validation set to warn, we can keep a log of documents that were inserted with arbitrary attributes that we didn't expect during the design phase and decide if this is a bug or a feature of our design.
- Use MongoDB Compass: MongoDB's free visualization tool is great to get a quick overview of our data and how it grows across time.
- Respect the maximum document size of 16 MB: The maximum document size for MongoDB is 16 MB. This is a fairly generous limit but it is one that should not be violated under any circumstance. Allowing documents to grow unbounded should not be an option and as efficient as it may be to embed documents, we should always keep in mind that this should be under control.
- Use the appropriate storage engine: MongoDB has introduced several new storage engines since version 3.2. The in-memory storage engine should be used for real-time workloads, whereas the encrypted storage engine should be the engine of choice when there are strict requirements around data security.
Best practices for write durability
Writing durability can be fine tuned in MongoDB and according to our application design it should be as strict as possible without affecting our performance goals.
Fine tune data flush to disk interval: In the WiredTiger storage engine, the default is to flush data to disk every 60 seconds after the last checkpoint, or after 2 GB of data has been written. This can be changed using the --wiredTigerCheckpointDelaySecs command-line option.
In MMAPv1, data files are flushed to disk every 60 seconds. This can be changed using the --syncDelay command-line option:
- With WiredTiger, use the XFS filesystem for multi-disk consistent snapshots
- Turn off atime and diratime in data volumes
- Make sure you have enough swap space, usually double your memory size
- Use a NOOP scheduler if running in virtualized environments
- Raise file descriptor limits to the tens of thousands
- Disable transparent huge pages, enable standard 4K VM pages instead
- Write safety should be at least journaled
- SSD read ahead default should be set to 16 blocks, HDD should be 32 blocks
- Turn NUMA off in BIOS
- Use RAID 10
- Synchronize time between hosts using NTP especially in sharded environments
- Only use 64-bit builds for production; 32-bit builds are outdated and can only support up to 2 GB of memory
Best practices for replication
Replica sets are MongoDB's mechanism to provide redundancy, high availability, and higher read throughput under the right conditions. Replication in MongoDB is easy to configure and light in operational terms:
- Always use replica sets: Even if your dataset is at the moment small and you don't expect it to grow exponentially, you never know when that might happen. Also, having a replica set of at least three servers helps design for redundancy, separating work loads between real time and analytics (using the secondaries) and having data redundancy built from day one.
- Use a replica set to your advantage: A replica set is not just for data replication. We can and should in most cases use the primary server for writes and preference reads from one of the secondaries to offload the primary server. This can be done by setting read preference for reads, together with the correct write concern to ensure writes propagate as needed.
- Use an odd number of replicas in a MongoDB replica set: If a server does down or loses connectivity with the rest of them (network partitioning), the rest have to vote as to which one will be elected as the primary server. If we have an odd number of replica set members, we can guarantee that each subset of servers knows if they belong to the majority or the minority of the replica set members. If we can't have an odd number of replicas, we need to have one extra host set as an arbiter with the sole purpose of voting in the election process. Even a micro instance in EC2 could serve this purpose.
Best practices for sharding
Sharding is MongoDB's solution for horizontal scaling. In Chapter 8, Storage Engines, we will cover how to use it in more detail, here are some best practices based on the underlying data architecture:
- Think about query routing: Based on different shard keys and techniques, the mongos query router may direct the query to some or all of the members of a shard. It's important to take our queries into account when designing sharding so that we don't end up with our queries hitting all of our shards.
- Use tag aware sharding: Tags can provide more fine-grained distribution of data across our shards. Using the right set of tags for each shard, we can ensure that subsets of data get stored in a specific set of shards. This can be useful for data proximity between application servers, MongoDB shards, and the users.
Best practices for security
Security is always a multi-layered approach and these few recommendations do not form an exhaustive list, rather just the bare basics that need to be done in any MongoDB database:
- HTTP status interface should be disabled.
- REST API should be disabled.
- JSON API should be disabled.
- Connect to MongoDB using SSL.
- Audit system activity.
- Use a dedicated system user to access MongoDB with appropriate system level access
- Disable server-side scripting if not needed. This will affect MapReduce, built-in db.group() commands, and $where operations. If these are not used in your codebase, it is better to disable server-side scripting at startup using the --noscripting parameter.
Best practices for AWS
When using MongoDB, we can use our own servers in a datacenter, a MongoDB hosted solution like MongoDB Atlas, or get instances from Amazon using EC2. EC2 instances are virtualized and share resources in a transparent way with collocated VMs in the same physical host. So there are some more considerations to take into account if going down that route:
- Use EBS optimized EC2 instances.
- Get EBS volumes with provisioned IOPS (I/O operations per second) for consistent performance.
- Use EBS snapshotting for backup and restore.
- Use different Availability Zones for High Availability and different regions for Disaster Recovery.
Different availability zones within each region that Amazon provides guarantee that our data will be highly available. Different regions should only be used for Disaster Recovery in case a catastrophic event ever takes out an entire region. A region can be EU-West-2 for London, whereas an availability zone is a subdivision within a region; currently two availability zones are available for London. - Deploy global, access local.
- For truly global applications with users from different time zones, we should have application servers in different regions access data that is closest to them using the right read preference configuration in each server.