One of the most common mistakes made when setting up storage for a big data environment is using one solution, frequently an RDBMS, to handle all of your data storage requirements. You will have many tools available, but none of them are optimized for the task they need to complete. One single solution is not necessarily the best for all of your needs; the best solution for your environment might be a combination of storage solutions that carefully balance latency with the cost. An ideal storage solution uses the right tool for the right job. Choosing a data store depends upon various factors:
- How structured is your data? Does it adhere to a specific, well-formed schema, as is the case with Apache web logs (logs are generally not well structured and so not suitable for relational databases), standardized data protocols, and contractual interfaces? Is it completely arbitrary binary data, as in the cases of images, audio, video, and PDF documents? Or, is it semi-structured...