I realise that I’ve been neglecting my blog for a while now, with my last post published at the end of March. Quite frankly, I’ve not had anything interesting to write about. However, career starting pastures new, I’m starting to have interesting things to write about again. Over the course of time, I’ll probably post a lot of stuff about Amazon Web Services (AWS), Zend Framework, amongst other things. Today’s babble though is about implementing AWS as your core hosting infrastructure, and the benefits and downsides to it. I’ll also post my findings about the best way (in my opinion) to implement certain requirements.
Firstly, let me start off by giving you an overview of the current network. We’re starting with perhaps a dozen or so servers in 4-5 “clusters” in an off-site datacenter, hosting around 2000+ websites. In each cluster is 2 web servers and a database server (MySQL) sat behind a load balancer for the web servers. Having spent months overviewing Amazon’s AWS along with running costs for each, it was decided to commence hosting on an EC2 instance, and slowly migrate all the sites off of the existing server infrastructure onto an EC2 instance, and using Amazon RDS as a MySQL database server (RDS I think is fantastic, in that you can snapshot the entire database without taking it down, and it can auto-scale as traffic increases. You also don’t have to worry about setting up MySQL replication, duplication, master-master replication/load balancing. It’s all done for you).
The EC2 instance has a number of configuration options already set. For example, Apache is configured as a “back-end” server to process requests for dynamic content, PHP, and the like. Nginx is the “front-end”, receiving requests on port 80. Nginx will satisfy all requests for static content (images, css, etc), and proxy all requests for dynamic content (or anything not matching a static rule) to Apache internally using a localhost URL. I have implemented a configured such as this a few times with a general degree of success. What you have to be careful about here is writing your Nginx location rules correctly so that you’re not simply proxying everything into your back-end server. This, for certain, simply creates a bottleneck as opposed to helping ease the load.
The Apache configuration is pretty straight forward. It’s configured as most Apache servers are, and has nothing special to it. The Nginx configuration starts to get a bit more advanced at this point. From here, we’re using symlinks for “user generated content” onto a file share mount using s3fs. Static content (images, icons, css, etc) is served directly via Nginx. An _files directory is created, and for the most part, symlinks its content to the s3fs mount point. This allows us practically unlimited storage from client websites where users can upload content to their hearts content without ever fearing a server running out of disk space anywhere. Nginx is then configured to proxy any requests to _files/* to the Amazon S3 bucket URL (http://s3.amazonaws.com/bucket/siteXYZ/_files/*).
There’s a couple of problems with this implementation, however. Firstly, Nginx is proxying the request to S3 itself. Simply put, Nginx is receiving the request from the user for, let’s say, http://site1.com/_files/user_content1.png. It’s intercepting that request, and due to the rules matching _files, initiates a proxy call to http://s3.amazonaws.com/mybucket/site1.com/_files/user_content1.png. Once it’s received the file from S3, it’s sent on to the user. Bottleneck? Yep. Helpful? No. Imagine it this way;
[Request for _files/user_content1.png] –> [Nginx rules proxy to S3] -> [Fetch site1.com/_files/user_content1.png from S3 bucket] -> [Return data to user]
The 2 middle requests are, essentially, redundant. Obviously in the interests of keeping the domain name for the request the same, this is perhaps the only way to do it. However, if your site suddenly becomes popular (SlashDotted, Digged, et. al.), Nginx has a real problem on its hands. Offloading all your content to S3 will actually make your scenario so much worse. 1000 simultaneous requests for user generated content will create 1000 incoming requests, 1000 outgoing requests to S3 (while the 1000 incoming requests are being “held”, waiting for a response), 1000 subsequent responses with data from S3, then 1000 responses to clients with data just received from S3. That’s not including the additional requests for resources Apache has to satisfy!
So, what’s the solution? Well, you have a variety of options (non-exhaustive!).
- Change Nginx’s user-generated-content-retrieval rule to send a 301 redirect response to the client with the S3 URL. This then pushes the resource fetching omen back to the client. Not ideal, as the client will end up making 2 requests for every user generated content file that is required on the page.
- Change your website’s code to vend S3 URL’s in place of URL’s linking to your domain for the user generated content. In this case, only 1 request is made by the client – to S3, as that’s what it’s been told to do. This should speed up page response time, at the very least.
Secondly, is the diabolically slow speed of S3FS when it comes to using rsync. Rsync, as the name implies, wants to compare the local file with the remote file. In order to do this, it must open the file and read it. However, because your storage is on S3, it has to download the file first! This adds in the order of hours to an rsync operation that might have only taken 30 minutes under normal circumstances (i.e; a filesystem that’s local to the rsync target, hard disk, etc). You could use another application such as s3cmd‘s sync function. This runs much much quicker, as it uses the ETag response from S3 which is basically an MD5 hash of the file currently stored. If the hash of the file in the response doesn’t match the calculated hash of the file on the local filesystem, (or the file doesn’t exist) it will upload it. However, any other file copy tool lacks compatibility with S3FS. If you were to upload files to S3 with any other tools, running an `ls -al` on the directory will show files with no permissions whatsoever. This is down to the way S3FS stores and keeps track of permissions on files. There is *no* way around this (that I have found, so far). How do we get around this?
- You could use CloudFront’s “Custom Origin” feature to request content from your main web server, and get Nginx to cleverly tell CloudFront to cache the file for a decent amount of time. This reduces the burden of vending static files from your web server almost completely. Almost, because CloudFront has to fetch the file the first time it’s requested, as it has to know what to respond with and cache! There’s a problem with this though, which I will explain a bit later on.
- Make your own CDN. This is an option I’m currently exploring. It basically involves an EC2 instance with a very large EBS volume and Nginx exclusively. If you were to rsync to this EC2 instance’s EBS volume as opposed to an S3 mount, because it’s a filesystem, and rsync will run quickly, and as expected. You could then configure a CloudFront distribution to have a Custom Origin of your CDN server. Configuration of Nginx would be generally straight forward. Respond on port 80 with a request from, say, /ebs/cdn as a document root. You can NFS mount this EBS volume over the EC2 network between instances. Your main website server can manage the file repositories and existence, and offload static content requests to your EC2 CDN server. This will allow you to resolve the problem in point 1 that I mentioned.
- Something else I haven’t thought of at this point.
The biggest caveat to using CloudFront is the fact that it only (currently) issues HTTP 1.0 requests. The biggest *but* about this is that HTTP 1.0 doesn’t support the “Host:” HTTP header. Every web server in existence relies on the presence of this header to know which website you’re requesting. It’s how, as the internet grows larger, and the pool of IP’s gets smaller, we can host 10’s, even 100’s or 1000’s of websites on a single server behind a single IP. Point 1 above mentions a problem with using a CloudFront distribution directly on your main web server. Your web server will rely on that “Host:” header to know which site to lookup. CloudFront doesn’t send this (even though it can receive it perfectly well, it doesn’t pass it through). So you have no idea which site to look for on your web server when CloudFront requests a file. You’ll get a similar issue with point 2. But, with point 2, you can fashion your directory structure in such a way, that it can be laid out such as /ebs/cdn/www.site1.com/image.png. Then, you can request http://s123456.cloudfront.net/www.site1.com/image.png. As the site URL is simply a directory path now, it will work as if it were called anything else. This would give you some kind of distinction between which website the user was after. Elegant? No. Would it work? Sure.
There are other issues that I’ll go into in future posts, but for now, this is something to get them brain cells into gear and think about the best way to configure your CDN. There’s no right or wrong solution to this. It’s entirely dependable on your environment, and what you ultimately want to achieve, and how you want to go about achieving it.