We already discussed one to one relations in MongoDB, and the main conclusion was that you should design your collections according to the most frequent access pattern. With one to many relations, this is still valid, but other factors may come into play.
Let’s look at a simple problem: we are a shop and we want to store customers’ information as well as their orders. Each customer can make several orders, this is a one to many relation. With MySQL or any relational database system, we would create 2 tables:
CREATE TABLE customer ( customer_id int(11) NOT NULL AUTO_INCREMENT, name varchar(50) NOT NULL DEFAULT '', zipcode varchar(10) DEFAULT NULL, PRIMARY KEY (customer_id) ) ENGINE=InnoDB; CREATE TABLE orders ( order_id int(11) NOT NULL AUTO_INCREMENT, customer_id int(11) NOT NULL DEFAULT '0', price decimal(10,2) NOT NULL DEFAULT '0.00', status tinyint(4) NOT NULL, PRIMARY KEY (order_id) ) ENGINE=InnoDB;
(Like in the previous post, I’m omitting foreign keys for clarity)
In MongoDB, we can use the same design but of course as we cannot do joins, it would not always work well. For instance, if we want to know the name of the customer who bought the order with _id = 100, we would need 2 queries:
> db.orders.find({_id:100},{customer_id:1,_id:0}) # Would return { "customer_id" : 123 }
and then
> db.customer.find({_id:123},{name:1,_id:0}) # Would return { "name" : "Stephane" }
While with MySQL, this is easily done in a single query:
mysql> SELECT name FROM customer INNER JOIN orders USING(customer_id) WHERE order_id = 100;
A good way to solve this problem with MongoDB would be to embed orders into customers, such as:
> db.customers.findOne() { "_id" : 123, "name" : "Stephane", "zipcode" : "75000", "orders" : [ { "_id" : 100, "price" : 100, "status" : 2 }, { "_id" : 234, "price" : 55, "status" : 1 }, { "_id" : 499, "price" : 899, "status" : 1 } ] }
And the query giving the name of the customer who bought the order with _id = 100 would be:
> db.customers.find({"orders._id":100},{name:1,_id:0}}
So far, so good. But here are a few questions about this design.
1. Would it still work if we needed to run queries on orders, for instance if we wanted to know the number of orders with status = 2?
Yes, this can be done with the aggregation framework with a query such as:
> db.customers.aggregate([ {$project:{"orders.status":1}}, {$unwind:"$orders"}, {$match:{"orders.status":2}}, {$group:{_id:null,total:{$sum:1}}} ])
Of course the query would have been much easier to write and would be more efficient if we had embedded customers into orders (in an order2 collection for instance):
> db.order2.find({status:2}).count()
So as always you will have to make decisions to find the design that best fits with your most frequent access pattern. And you will have to accept that the others access patterns may be slow. This is very different from a normalized schema that will be equally good for nearly every access pattern.
Also note that embedding orders into customers does not duplicate data because each order is unique. But embedding customers into orders would create a lot of data duplication because if a customer has 100 orders, the customer’s detail would be repeated 100 times. This can create inconsistencies that the application code will have to handle correctly.
2. Does embedding scale? By that I mean what happens if a customer has hundreds of thousands of orders?
This is in my opinion the main limitation of this design. First a document in MongoDB is limited to 16MB, so embedding a lot of objects into a document may not even be possible. With customers and orders you are likely not to meet this problem, but if you want to build a directory of people per city, it would be a bad design to create a document per city and embed all the people’s information.
And then anyway even if you do not reach the physical limits of MongoDB, having very large documents is bad for performance. All operations on very big documents will take a long time, so you cannot expect good performance in this case. Your only choice is then to normalize your data, which will make your queries harder to write and less efficient.
Conclusion
In this article, we have seen several topics that you will have to keep in mind when designing one to many relations in MongoDB:
- Denormalizing by embedding objects (like embedding orders into a customer) is a common desing pattern to deal with the lack of JOINs in MongoDB, and it applies well to this kind of relation.
- Depending on the way you use embedding, it may create data duplication. It is of course better if you can avoid it.
- Embedding works well when the one to many relation is actually a one to few relation. If the many is large, you may have to use a normalized schema for which the main drawback is that some queries will be difficult to write and/or very slow.
Therefore do not believe that because MongoDB is schemaless, you will not have to take care of your schema design!
Do you want to learn more on MongoDB? Come to my tutorial at PLUK in November!
The post Designing one to many relations – MongoDB vs MySQL appeared first on MySQL Performance Blog.