Amazon Web Services offers two forms of data storage. First is S3, a
key-value store allowing very large files if needed, but with a
pricing model that will cause problems for small files. Second is
SimpleDB, a schema-less or column-oriented database, which allows
storing many small pieces of data, but with limitations that prevent
it scaling to the point that S3 becomes economical. In this post I
describe a system I built to leverage SimpleDB to reduce the costs of
storing small files in S3.
Context: Biz360 Community Insights
Biz360 Community Insights is a social media
monitoring and measurement system. We consume various types of social
media -- blogs, forums, microblogs -- perform analysis on each item,
index it, store it, and present it to the user. We process tens of
millions of items every day, ranging from under 1k for a tweet with
metadata and analysis results, up to blog posts that can go above 100k. These items are indexed in Solr, but we don't store
the full text in the index for size reasons.
The Problem
The problem is where to store the complete article with metadata when
we are done with it. These average around a few kb. The initial
solution was to store them in S3, but our first month's bill was thousands of dollars. Not for the storage: it was a small amount of data. Not for the
data transfer: everything from EC2 is free. The cost was the $0.01 per
1000 PUT requests. We figured there had to be a way to bring this cost
down.
About the same time, we were having trouble with a choke point in our
application when we did a large map-reduce job to reconcile duplicate
articles which could be sped up with a persistent lookup table of some
kind. So we needed to stop storing all our items in S3, and we needed
a secondary index on at least some of our items. The first thing to
come to mind was SimpleDB. We were already in AWS, it didn't have any
operations headaches, and although the storage cost per GB was higher,
we had now seen that it the storage cost wasn't the most significant
factor.
Among the other options we considered was running our own non-relational
database with the top contenders being
CouchDB and the schema-less MySQL used
by
FriendFeed. CouchDB
didn't seem all that mature, and we had concerns about the volume of
data we could manage in MySQL and the number of machines we would
need.
SimpleDB turned out to have significant limitations. First, it limits
the size per attribute to 1024 bytes. If we wanted to actually
store our data in SimpleDB, we would have needed to do some
complicated system splitting the large items into 1kb blocks and store
them in several attributes. Secondly, it has a limit of 10GB per
domain. We expected to store orders of magnitude more than this.
The Solution
The general solution is that each S3 object stores multiple items, and
two sets of SimpleDB domains provide indexes to those items.
All of my objects have a unique item key. In the first set of domains, this
item key will be the "itemName" for SimpleDB. Some of my items have a
secondary key, a which I'll call dupeId for this article. In my secondary
SimpleDB set, this dupeId will be the itemName, and the item key will be
one of the stored attributes.
Items are distributed to domains based on their key. I pick a number
of domains that I will partition across. Here you want to estimate the
total volume of data you will be storing in SimpleDB (in my case about
200 bytes per item). Choose a number of domains so that each partition
will store no more than 1GB when you get to maximum capacity.
SimpleDB is actually limited to 10GB per domain, but I've been told by
Amazon that above 1GB performance starts to degrade. Also, you will
not be able to change the number of partitions after the fact, so this
gives you sufficient headroom when you realize that you can't actually
delete items. In my case, I'm using 30 domains for my main set of
domains, and 10 domains for my secondary key (which I don't need to
retain for as long). You can have up to 100 domains without having to
contact them. They've been happy to increase our instance limit in the
past, so if you need more, just ask.
To store data, I need to store three things:
- The full item in S3, but remember this is what I want to minimize.
- The index from the item key to the S3 location in the primary SimpleDB.
- The secondary index from the blogHash to the item key in the
secondary SimpleDB.
I will mostly ignore the third step, as it is straightforward and
there is nothing interesting about it.
Implementation
I've defined two services, DomainSet and DomainSetS3. My access to
SimpleDB is through via Typica. A
domain set is the SimpleDB portion, a domain set with S3 uses a domain
set, a buffer for each partition, and an S3 service. I use
JetS3t for S3 access.
A DomainSet contains an array of SimpleDB domains, and I index into
that array using hash of the item key modulo number of domains. In my
case, since we will also be using the system from our Rails front end,
I defined my own that I can keep consistent (though I could have just
duplicated java's hashcode in Ruby). The public interface consists
of only two methods:
public interface DomainSet {
public void store(SimpleDbItem item);
public SimpleDbItem find(String itemName);
}
SimpleDBItem is an object that contains the item name and a list of
attributes. Under the covers, this gets translated to and from the
Typica objects, but doesn't expose the complexity of the SimpleDB
API. Since I prefer composition to
inheritance,
DomainSetS3 uses this interface, but does not extend from it. This
does add a few lines of code (I have to store numDomains and
maxFailures and use both for instantiating my DomainSet), but it
avoids having a meaningless single argument
store(SimpleDbItem) method. The interface is:
public interface DomainSetS3<T> {
public void store(SimpleDbItem metadata, T contents);
public void flush();
public SimpleDbItem find(String itemName);
public T loadContents(SimpleDbItem item);
}
In this case, I've let the abstraction leak a bit -- the only way to
load the contents of an item is to find it in SimpleDB, and then do a
separate request to load it. That's what it's doing under the covers,
of course, and if I combined both of these into one method (perhaps
returning a SimpleDbItemWithData<T>), later down the line I'd
have to provide the simple find for performance reasons. When you want
to store an object, you would create a SimpleDbItem with the metadata
(the key and date are the only values we currently use), and store
it. The DomainSetS3 would add these to the buffer for whatever
partition the item key hashes to until it gets to a predetermined max
buffer size. Depending on your data and access patterns, 5-20 items
per S3 object is about right. When the buffer is full, I create a
HashMap<String, T> and serialize the HashMap into JSON. I create
a UUID and store this JSON object in S3 under that id. I then write my
metadata with the UUID as an additional attribute "s3loc" to SimpleDB.
To retrieve the data from S3, I need the metadata from SimpleDB. I
fetch the object in s3loc from S3, then deserialize it back to
Java. This places limitations on what kind of objects you can store,
but it's not too painful with Jackson
-- the only real restriction is that you know ahead of time what types
of objects you will be deserializing and that they not have any fields
which aren't concrete types. What I get back is the HashMap of String
-> T, but my key here is the same as the item key, which of course
I have.
If I need to do updates, I will follow the same procedure as above,
which results in leaving the original item in the S3 object, but
overwriting the s3loc with the location of the updated item. We're
wasting storage space, but we will always see the correct
version. Without some form of transactions or conditional writes it
would be impossible to guarantee that you can do updates with this
scheme. It's important that anyone building a system on top of AWS
understand the consistency guarantees. Again, this problem (and lots
more!) would go away if Amazon exposed the "semantic reconciliation"
step in the Dynamo system underlying S3. Of course, this step still
exists in S3, but for simplicity the semantics are hard coded to "last
write wins".
Your mileage may vary
This system doesn't solve everyone's problem. If you intend to allow
outside access to your data via public S3 buckets or Cloudfront, this
won't work. Additionally, if the data you are storing is too big to
put into SimpleDB, it may be big enough that repeated downloads will
outway the cost savings on the PUT side.
If you are doing a large number of updates, and you are storing data
for a long time, you may find that the storage costs starts to be
significant. I suppose if you were storing too much obsolete data, you
could iterate through old items, repack these into new S3 objects, and
update SimpleDB, but at this point, your SimpleDB costs will outweigh
your S3 savings.
A missed opportunity?
This was an interesting technical problem, finding the best solution
with the pieces at hand, but a voice from my econ classes is
whispering in my head "arbitrage", meaning that Amazon actually has a
perverse pricing system that charges me less money to use more
resources. If I can store items in S3 by laying another system on top
of it, Amazon would certainly be able to do the same, do so more
cheaply, and without the inconsistent interface that you need to use
if you adopt my system. Unless S3 has significantly different
replication and reliability characteristics than SimpleDB, I suspect
that they are overcharging for small PUTs, and this is keeping people
from using the service to the fullest extent.
I've heard from Tim
Robertson
that he ran into the same problem with small files and scaled back
what he was storing in S3 to save on
costs. HotPads
has also done the calculation and this was a factor in their
reluctance to recommend others adopt Cloudfront as they did.
SimpleDB: a work in progress
SimpleDB is a fantastic system, but it's not quite done yet. Amazon
labels it a beta, and it's certainly good enough to use in production,
but there are lots of desireable features that aren't yet there. As I
write this, I have a that has been running for a day just going
through and deleting old items from my SimpleDB domains -- bulk delete
or delete by query aren't supported yet.
As I learn more about other options, I'm having trouble justifying the
hassle of dealing with things like lack of bulk delete, not being able
to store large values in the table, and the size limits. HBase is
becoming more mature and is now being used to directly serve content
to the web. SimpleDB is likely slightly better value if you are
pinching pennies (for us, SimpleDB is less than 10% of our Hadoop
cluster costs), but if I were doing this again, and I were already
using Hadoop, I would prefer HBase. If you have a smaller system, and
don't want to deal with administration headaches, SimpleDB can still
be a good choice.