Google AppEngine, BigTable and why RDBMS mentality is harmful
“or How do I write the correct GQL to join these tables together to give me the correct output”
Yet again people are asking on the AppEngine mailing list how to write the correct sorts of joins to join some tables together.
Looking at the design, it was clearly a great database design, a product table, a customer table, and a many to many join table with an attribute to qualify the join. Exactly the sort of tables that you were taught to write in your CompSci degree or courses, or learnt from experience over years of working with databases.
The problem is that BigTable is not a relational database. The Google team has tried as hard as possible to emphasise this, but they didn’t really provide enough examples that people really got it clear.
BigTable is not a relational database, now go write that out a hundred times, and when your done come back and finish reading.
When it comes to BigTable you want to design your database not around “What bits of data should be grouped together”, but around the principle, “What bits of data will I need to read concurrently”.
In this case, what you want to get out is
Purchase information (Order, Date), Customer information (Name, Country, …), Product information (Name, Code, …).
Therefore your table should be.
Purchase:
Customer Name
Customer Country
Product Code
Product Name
Purchase Order Number
Date Of Order
Now your query becomes
Purchase.all().filter("customer_name =",customer).filter("product_name =", product).fetch(100)
Easy huh? It’s all one big table.
Say you want to show a customer what items they ordered for a given purchase order…
Purchase.all().filter("purchase_order =",order_num).fetch(100)
Updating becomes a little harder, but is still pretty easy.
The question then comes, how do I update a product name, if I change my widgets from Acme Widget to “Acme Superwidget”?
Well firstly, really, how often do you do that? I bet it’s a number of orders of magnitude less often than you execute the queries above.
If you must, it is something like:
items = Purchase.all().filter("product_name =",product).fetch(1000)
for item in items:
item.product_name = new_name
item.put()
Yes that last part is a bit nasty, if GQL supported updates it would be nice, but writing the updates in python is hardly taxing.
I hope that helps someone
April 15th, 2008 at 1:09 pm
Best write up of the mindset I’ve seen so far. Danke. And sub’d. =)
April 16th, 2008 at 7:47 pm
As the Zen Master said, ‘When you have forgotten all you know, then you will achieve Enlightenment…’
OK, I’m doing my best to ‘empty my mind’, and here are my realizations:
1. So, there are really no ‘foreign keys’, in the sense of Customer = Customer_id(5)?
2. Then, my customer_name will be written all over the place in different tables - that’s denormalization?
3. If I want to track projects-persons, I have to make a list property with the projects ['house', 'office', 'street'], instead of a join table.
4. And I really have to rethink my database design with the business logic of queries governing all.
It’s really a return to a much more naive state of how databases were done ‘back in the day’.
Thanks!
April 16th, 2008 at 9:22 pm
Whoops, I made the assumption that everyone knows what normalisation is, or what denormalisation is, and of course not everybody does. I was going to write an overview of what normalisation is in a database, but thats actually quite complicated to do.
A simple explanation is that normalisation attempts to remove all duplication between data in tables.
Thats the thing that makes you want to put customer information in a customer table, order information in an order table and create relationships between them.
This is what we’ve done for nearly 30 years now, so it’s become second nature to us.
However, sometimes we realise that it will cost more to do the join on the two tables when we want to query it than we save by seperating the data.
We’ve recently had an example of that at my work, where we discovered that in order to select about 10 items from our database of nearly a million items, it had ot join all those million items into about 5 tables because the where clause relied on properties on multiple tables.
The solution then is to go the opposite way to normalisation, to denormalise. This means that we copy data from our end tables, often into the join table, which means that the database can be more efficient, limiting the join table first, and then performing the join.
Finally, Google AppEngine does have foreign keys, at least in the django model anyway, and they’re called ReferenceProperties.
Hope that helps
April 17th, 2008 at 12:10 am
You’re still missing a big part of the picture.
First, they need a way to do mass updates stil, since you can only fetch 1000 at a time, and doing so many updates in that manner is *really slow*. You will run out of time and Google will kill your request. In the meantime, the best you can do is some sort of AJAX page where it then triggers all the other hits that try to load subsets of products and update their names. Either way, its still really messy right now.
Second, with regards to storing a customers data. You’re missing the important bit about Entity Groups and ancestor keys. If you have a batch of customer data that you need to work with at one time, you should ensure that the entities are keyed to the same ancestor.
This allows you to:
a) ensure that all the data you want to get to for a specific customer is stored in the same portion of the distributed database.
b) do transactions that wrap the whole group of entities
c) update multiple entities in a single put() statement
There’s definitely a lot to consider with BigTable.