Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Distributed NoSQL: MongoDB

by Haifeng Li
October 15, 2014
in Uncategorized
Home Uncategorized
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

We have explored several interesting distributed key-value databases including HBase and Accumulo, Riak, and Cassandra. Although key-value pairs are very flexible, it is tedious to map them to objects in applications. In this post we will learn about a popular document-oriented database MongoDB. MongoDB uses JSON-like documents with dynamic schemas, making the integration of data in certain types of applications easier and faster. Beyond key search, MongoDB supports search by value, range queries, and text searches. Any field in a document can be indexed (by B-Trees, similar to those in RDBMS).

Table of Contents

  • Data Model
  • Storage
  • Cluster Architecture
  • Summary
    • Choosing a NoSQL

Data Model

Documents are addressed in the database via a unique key. The documents contain one or more fields, and each field contains a value of a specific data type (maybe object or array). Documents that tend to share a similar structure are organized as collections. Compared to relational databases, collections could be considered analogous to tables and documents analogous to records. However, every record in a table has the same sequence of fields, while documents in a collection may have fields that are completely different. MongoDB stores documents in a binary representation called BSON (Binary JSON). The BSON encoding extends the JSON to include additional types such as date, int, long, double, byte array, etc.

Storage

In MongoDB, each database has a namespace file holding entries which each points to (the first extent of) a collection or index in data files. Data files are broken into contiguous disk space called extents (grow exponentially up to 2GB) to hold documents or B-Tree index nodes. Data files (and journals) are aggressively pre-allocated. MongoDB stores each document (plus some extra padding bytes as growth buffer) in an extent as a contiguous block. The documents are updated in-place and the whole document will be moved to a bigger space if the update increases the size of document beyond its current allocated space. All data files are memory mapped into the virtual memory of mongod, the primary daemon process for the MongoDB system. MongoDB reads/writes to RAM directly and OS takes care of the rest. This greatly simplifies the cache and file access logic in MongoDB and also leverages OS’s LRU (least recently used) cache behavior. On the other hand, it incurs the fragmentation management and read-ahead overhead. Clearly, this also limits the data size MongoDB can handle on 32-bit systems due to inherent memory limitations.

Distributed NoSQL: MongoDB


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


MongoDB Data Structure

As shown in the above diagram, data and space are mainly organized through doubly-linked-list. Each collection of data is organized in a linked list of extents. Each extent points to a head/tail of another linked list of documents (up to 16MB). Data files may get fragmented over time because of updates and deletes. It especially gets worse if documents have varized sizes. Fragmentation wastes memory and disk space and also make writes scattered and slower. MongoDB provides compact command for defragmentation.

Changes in memory mapped files are flushed to disk every 60 seconds. MongoDB uses write ahead logging to an on-disk journal to guarantee write operation durability and to provide crash resiliency. The write-ahead log is committed every 100 milliseconds by default (configurable with -journalCommitInterval). So the maximum data loss is 100 ms on a hard crash. To achieve durability (i.e. data written to the on-disk journal when acked), you can use the “j” option in write concern (discussed in details later).

MongoDB uses a readers-writer lock that allows concurrent read access to a database but exclusive write access to a single write operation. Before version 2.2, this lock was implemented on a per-mongod basis. Since version 2.2, the lock has been implemented at the database level. In a properly designed schema a write will hold the lock for approximately 10 microseconds. If a slow-running operation is predicted (i.e., a document or an index entry will need to be paged in from disk), then that operation will yield the write lock.

Cluster Architecture

Although MongoDB can run as a single instance, it is often configured in a cluster environment to provide scalability and high availability. High availability is achieved in MongoDb via Replica Set, which provides data redundancy across multiple physical servers, including a single primary as well as multiple secondaries that replicate the primary’s oplog and apply the operations to their data sets. The primary accepts all write operations from clients and therefore provides strict consistency. By default, clients also read from the primary. To improve read throughput, clients can specify a read preference to send read operations to secondaries. With “nearest” read preference, the client driver periodically pings the members and will favor issuing queries to the one with lowest latency. Notice that read request is issued to only one node, there is no quorum read or read from multiple nodes. With read preferences, however, the read results may not reflect latest writes because replications are done asynchronously. MongoDB allows users to specify write availability in the system, which is called the write concern. Write concern can include the w option to specify the required number of acknowledgments from replica set before returning, the j option to require writes to the journal before returning, and wtimeout option to specify a time limit to prevent write operations from blocking indefinitely. Prior to November 2012, MongoDB’s client drivers return when the writes had only entered the client’s outgoing queue. Now the default write concern acknowledges writes received by the primary (but before writing to journal), allowing the client to catch network exceptions and duplicate key errors. 

Distributed NoSQL: MongoDB

Replic Set

Within the replica set, members are interconnected with each other to exchange heartbeat message. When a primary does not communicate with the other members of the set for more than 10 seconds, the replica set will attempt to select another member to become the new primary. The first secondary that receives a majority of the votes becomes primary. Because ofasynchronous replication, the newly elected primary doesn’t necessary having all the latest updates.

Note that a new primary may be elected even if the old one didn’t crash because of network partition or simply over-loaded primary. In these situations primaries will have accepted write operations that have not replicated to the secondaries after a failover occurs. When the former primary rejoins the replica set and attempts to continue replication as a secondary, the former primary must revert these operations to maintain database consistency across the replica set.

Although replica set provides data redundancy and potentially load balance of reads with eventually consistency, it doesn’t provide linear scalability for writes since all updates still has to go to the single primary. To load balance writes, MongoDB provides auto-sharding including range-based, hash-based, and tag-award sharding. With sharding, a collection is partitioned into chunks and have chunks distributed across multiple shards.

Distributed NoSQL: MongoDB

Sharding

A sharded MongoDB cluster consists of the shards, config servers, and routing instances. A shard is a MongoDB instance that holds a subset of a collection’s data. Each shard is usually a replica set in production although it can be a single mongod instance. Each config server is a mongod instance that holds metadata about the cluster. The metadata maps chunks to shards. Each router is a mongos instance that routes the reads and writes from applications to the shards. Applications do not access the shards directly.

For writes, the route server will forward the request to the corresponding primary server hosting the chunk whose key range covers the partition key of the document. In case of reads, the routing server will examine whether the partition key is part of the selection criteria and if so will only route the request to the corresponding shard. However, if the partition key is not part of the selection criteria, then the routing server will forward the request to every shard which will perform its local search, and the results will be gathered at the routing server and return to the client.

As chunks grow beyond the specified chunk size a mongos instance will attempt to split the chunk in half. Splits may lead to an uneven distribution of the chunks for a collection across the shards. In such cases, the mongos instances will initiate a round of migrations to redistribute chunks evenly across shards.

Summary

MongoDB is an agile database that allows schemas to change quickly as applications evolve. With careful setup, MongoDB clusters can also provide scalability and high availability.

ALSO IN THIS SERIES:


4524698671_d27be90ea5_zChoosing a NoSQL

In the first installment of his Understanding NoSQL series, guest contributor Haifeng Li gives us a crucial advice for choosing a NoSQL database. Tips include: ignore benchmarking, and look into which problem the developers were originally trying to solve.


0b16661Haifeng Li is the Chief Data Scientist at ADP. He has a proven history in delivering end-to-end solutions, having previously worked for Bloomberg and Motorola. He is a technical strategist with deep understanding of computer theory and emerging technologies. He also has a diverse academic background, researching in fields including machine learning, data mining, computer vision, pattern recognition, NLP, and big data. His personal blog can be found here.


Follow @DataconomyMedia

(Image Credit: 10gen)

Tags: Choosing a NoSQLWeekly Newsletter

Related Posts

Data sourcing is still a major stumbling block for AI

Data sourcing is still a major stumbling block for AI

August 18, 2022
AI and data analytics COVID-19

How AI and Data Analytics Will Impact The Era of COVID-19

February 17, 2022
Medical field changing thanks to AI

The Medical Field is Changing Because of Artificial Intelligence

August 19, 2021
Zeni series B funding

AI-Powered Fintech Startup Zeni Raises $34m in Series B Round

August 6, 2021
Coming up LIVE: Can we have both Privacy and Security?

Coming up LIVE: Can we have both Privacy and Security?

June 4, 2020
How GDPR is Affecting Marketing Data

How GDPR is Affecting Marketing Data

July 5, 2018

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Microsoft 365 Copilot is more than just a chatbot

The silent spreaders: How computer worms can sneak into your system undetected?

Mastering the art of storage automation for your enterprise

Can Komo AI be the alternative to Bing?

LinkedIn AI won’t take your job but will help you find one

Where does your data go: Inside the world of blockchain storage

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.