Archive for the ‘datacenter’ Category

HDD IOPS limiting factor – seek or rpm?

Monday, October 8th, 2007

Any storage experts out there? Can you forward this to any you may know?

An interesting thread developed in the comments on my post about Dell’s MD3000 storage array regarding theoretical maximum random IOPS to a single HDD. I’m hoping by bringing it up to the blog level, we can get some smart people who know what they’re talking about (ie, not me) to weigh in.

I’ve always believed that for a small random write workload, the revolutions per minute (rpm) of the drive was the biggest limiting factor. I think I’ve believed this for a few reasons:

  • It seems logical that the biggest “time waster” in seek time is probably rpm anyway. Even if the drive arm has found the right position on the platter, it likely has to wait some amount of time, up to a full revolution, before it can write.
  • rpm is a “fixed” number, and thus easier to calculate, than seek which is more variable. So taking the easy way out, one of my favorite hobbies, seemed appropriate.

Using this theory, a 7200rpm drive can do a theoretical maximum of 120 IOPS, and a 15K drive can do 250. Note that these are fully-flushed non-cached writes to the spinning metal, with no buffering or write combining. Over the years, my own tests seem have validated this theory, and so I’ve just always believed it to be gospel.

Tao Shen, though, commented that my assumption is wrong, and that seek time is the limiting factor that matters, not rpm, and that faster drives can deliver more IOPS than my rpm math. He posits that a 15K drive with a 2ms seek time can do 500 IOPS. Now, he may have access to better drives than I do, since I think our fastest are 3.5ms (best case scenario), not 2ms. That’s what the latest-and-greatest Seagate Cheetah 15K.6 drives seem to do, too.

So which is it? Am I totally smoking crack? Is he? Or is the truth that seek time and rpm are so intimately tied together that separating them is impossible?

How does one calculate theoretical maximum IOPS?

Dell MD3000 – Great DAS DB Storage

Monday, October 1st, 2007

So I’ve written about storage before, specifically our quest for The Perfect DB Storage Array and how Sun’s storage didn’t stack up with their excellent servers. As you can probably tell, I spend a lot of my time thinking about and investigating storage – both small-and-fast for our DBs and huge-and-slower (like S3) for our photos.

I believe we’ve finally found our best bang-for-the-buck storage arrays: Dell MD3000. Here’s a quick rundown of why we like them so much, how to configure yours to do the same, and where we’re headed next:

  • The price is right. I have no idea why these companies (everyone does it) continue to show expensive prices on their websites and then quote you much much cheaper prices, but Dell is no exception. Get a quote, you’ll be shocked at how affordable they really are.
  • DAS via SAS. If you’re scaling out, rather than up, DAS makes the most sense and SAS is the fastest, cheapest interconnect.
  • 15 spindles at 15K rpm each. Yum. Both fast and odd. Why odd? Because you can make a 14 drive RAID 1+0 and have a nice hot spare standing by.
  • 512MB of mirrored battery-backed write cache. Use write-back mode to have nice fast writes that survive almost all failure scenarios.
  • You can disable read caching. This is a big one. Given we have relatively massive amounts of RAM (32GB on server vs 512MB on controller) *and* that the DB is intelligent at reading and pre-fetching precisely the stuff it wants, read caching is basically useless. Not only that, but it harms performance by getting in the way of writes – we want super-fast non-blocking writes. That’s the whole point.
  • You can disable read-ahead prefetching. Again, our DB does its own pre-fetching already, so why would we want the controller trying to second guess our software? We don’t.
  • The stripe sizes are configurable up to 512KB. This is important because if you’re going to read, say, a 16KB page for a DB, you want to involve only a single disk as often as you can. The bigger the stripes, the better the odds are of only using a single disk for each read.
  • The controller ignores host-based flush commands by default. Thank goodness. The whole point of a battery-backed write-back cache is to get really fast writes, so ignoring those commands from the host is key.
  • They support an ‘Enhanced JBOD’ mode where by you can get access to the “raw” disks as their own LUNs (in this case, 15), but writes still flow through the write-cache. Why is this cool? Because you can move to 100% server-controlled software storage systems, whether they’re RAID or LVM or whatever. More on this below…

Ok, sounds good, you’re thinking, but how to I get at all these goodies? Unfortunately, you have to use a lame command-line client to handle most of this stuff and it’s a PITA. However, you asked, so here you go (commands can be combined):

  • disable read cache: set virtualDisk["1"] readCacheEnabled=FALSE
  • disable read pre-fetching: set virtualDisk["1"] cacheReadPrefetch=FALSE
  • change stripe size: read the docs for how to do this on new virtualDisks, but to do online changing of existing ones – set virtualDisk["1"] segmentSize=512
  • Enhanced JBOD: Just create 15 RAID 0 virtual disks! :)
  • BONUS! modify write cache flush timings: set virtualDisk["1"] cacheFlushModifier=60 – This is an undocumented command that changes the cache flush timing to 60 seconds from the default of 10 seconds. You can also use words like ‘Infinite’ if you’d like. I haven’t played with this much, but 10 seconds seems awfully short, so we will.

Wishlist? Of course I have a wishlist. Don’t I always? :)

  • This stuff should be exposed in the GUI. Especially the stripe size setting should be easily selectable when you’re first setting up your disks. It’s just dumb that it’s not.
  • Better documentation. After a handy-dandy Google search, it appears as if the Dell MD3000 is a rebranded LSI/Engenio array, which lots of other companies also appear to have rebranded, like the IBM DS4000. But the Engenio docs are more thorough, which is how I found the cacheFlushModifier setting. (On a side note, why do these companies hide who’s building their arrays? They don’t hide that Intel makes the CPUs… Personally, I’d rather know)
  • Faster communication. I asked Dell quite awhile ago for information on settings like these and I had to wait awhile for a response. I imagine this might be related to the Engenio connection – Dell may have just not known the answers and had to ask.
  • Bigger stripe sizes. I’d love to benchmark 1MB or bigger stripes with our workload.
  • Better command-line interface. Come on, can’t we just SSH into the box and type in our commands already?

Ok, so where are we going next?

  • ZFS. I believe the ‘Enhanced JBOD’ mode (15 x RAID-0) would be perfect for ZFS, in a variety of modes (striped + mirrored, RAID-Z, etc). So we’re gonna get with Sun and do an apples-to-apples comparison and see what shakes out. Our plan is to take two Sun X2200 M2 servers, hook them up to a Dell MD3000 apiece, run LVM/software RAID on one and ZFS on the other, then put them under a live workload and see which is faster. My hope is that ZFS will win or be close enough that it doesn’t matter. Why? Because I love ZFS’s data integrity and I believe COW will let us more easily add spindles and see a near-linear speed increase.
  • Flash. We’ve been playing around with the idea of flash storage (SSD) on our own for awhile, and have been talking to a number of vendors about their approaches. It’s looking like the best bet may be to move from a two-tier storage system (system RAM + RAID disks) to a three-tier system (system RAM + flash storage + RAID disks) to dramatically improve I/O. If we come across anything that works in practice, rather than theory, I’ll definitely let you know.
  • MySQL. We’ve now got boxes which appear to not be CPU-bound *or* I/O bound but are instead bounded by something in software on the boxes, either in MySQL or Linux. Tracking this down is going to be a pain, especially since it’s out of my depth, but we’ve gotta get there. If anyone has any insight or ideas on where to start looking, I’m all ears. We have MySQL Enterprise Platinum licenses so I can probably get MySQL involved fairly easily – I just haven’t had time to start investigating yet.

Also, you might want to check out this review of the MD3000 as well, he’s gone more in-depth on some of the details than I have.

Finally, I’m hoping other storage vendors perk up and pay attention to the real message here: Let us configure our storage. Provide lots of options, because ‘one size fits all’ is the exception, not the rule.

Sun’s announcement today that they’re unifying Storage and Servers under Systems is a good move, I think, but they’ve still got work to do. I believe (and everyone at Sun has heard this from me before) that their storage has been failing because it’s not very good. I hope this change does make a difference – because Jonathan’s right that storage is getting to be more important, not less.

UPDATE: One of the Dell guys who works with us (and helped us get all the nitty gritty details to configure these low-level settings) just posted his contact info in the comments. Feel free to call or email him if you have any questions.

Datacenter love: Equinix

Thursday, July 26th, 2007

I write a lot about products and companies that have potential, but aren’t quite perfect, like Amazon Unbox on TiVo and lots of Sun stuff. But this week’s outage at 365 Main, a datacenter in San Francisco (which we don’t use), reminded me that there are a few products and companies we love that I don’t say nearly enough about. So I’ll start with our datacenter, Equinix, and try to post about some of the others, too.

SmugMug got its start with 3 old used VA Linux boxes (dual 700mhz Pentium 3s with 2GB of RAM which are still in production today and have been our most reliable boxes) from a dead dotcom, which we threw into a friend’s cheap rack at Hurricane Electric. Once the money started flowing in, and we ran into HE’s power contraints and poor bandwidth, we hunted around for datacenter space. Equinix had the very best reputation among the Operations crowd here in Silicon Valley, so we gave them a shot and pulled out of Hurricane Electric.

I should warn you up front that there’s a little “sticker shock” when you first talk with Equinix (ok, and every time you need to buy more stuff from them, it returns), but in the end, it’s well worth it. It turns out that in life, some things are worth paying for. Datacenter space is certainly one of those things (and we feel like photo sharing is too!).

In the ~4 years we’ve been with Equinix, we’ve had only one major problem: They sold our power out from under us (to Yahoo) which forced us to move from one of their locations to another. Ugh. Datacenter moves, especially with hundreds of terabytes of disks, really suck. Luckily, thanks to decent system architecture and some magic from Amazon S3, we were able to do 99% of our move during normal business hours over the course of a month with no impact on our users.

In all fairness to Equinix (though this is no excuse), they weren’t the only datacenter that had poorly prepared for the ‘Power is King’ change in the datacenter landscape that happened a few years back. Plenty of other companies with other providers tell me the same story, so we’re not alone. Datacenters all over the place used to sell you mostly based on space (square footage) rather than power (watts). They all got burned when CPU and server vendors started getting really fast & dense gear. Nowadays, almost the entire negotiation is regarding power and everyone has empty dead space in their rented cages. Such is life.

On the bright side, everything else about Equinix rocks:

  • Power. I’m surprised to hear all of the horror stories out of 365 Main because I assumed they were as good as Equinix has been for us. We haven’t had a single power-related outage in all of the years we’ve been there. It just works – and it’d better, that’s the biggest reason we use a datacenter.
  • Metro cross-connects. If you’re hosted in multiple Equinix datacenters in a single metro area, like we are, you can get cheap (a few hundred bucks per month) GigE cross-connects wired between your various locations.
  • Support. I’m still surprised every time we need to use Equinix’s support staff and they’re actually super-knowledgeable and helpful. I’m talking about hardcore networking and routing questions. BGP, whatever, you name it – they know it. Better than we do.
  • Equinix Direct. I’m always surprised when I talk to other Equinix customers who don’t know about this gem. It’s a way to provision your IP transit providers on a month-by-month basis with no minimum commits or contracts. You pick your providers and pay-as-you-go. Pretty sweet. We’re already directly multi-homed on GigE with multiple providers, but we mix in Equinix Direct to have access to still more. Best thing? ED doesn’t add an extra BGP hop, so your routes still look fast (as opposed to someone like InterNAP who adds an extra BGP hop to do similar stuff).
  • Security. 5 biometric scanners are between you and your cage when you enter the building, with live security on hand 24/7. Stuff like this is fairly common at high-end datacenters, but it’s important, so I’m mentioning it anyway.
  • Bandwidth providers. Equinix is a carrier-neutral facility, and basically everyone has connectivity there, so you can easily pick whomever you’d like to carry your traffic.

Of course, they do all of the other myriad things a datacenter is supposed to do. One of the reasons I haven’t blogged about them in the past is because they just work – and they work so well, I just don’t spend much time thinking about them.

Which, of course, is the way it’s supposed to be. :)

(Now, of course, I’ve jinxed the whole thing like Red Envelope and our datacenters are going to explode in a Martian Invasion. Sorry about that!)

Silent data corruption on AMD servers

Wednesday, July 25th, 2007

One of my readers, Yusuf Goolamabbas, let me know about a nasty silent data corruption on AMD servers with 4GB or more of RAM running Linux. Yikes! This is the sort of thing that keeps me up at night. Yusuf linked me to two bugs on the subject, one at kernel.org and another at Red Hat.

Lots of servers from a variety of manufacturers seem to be affected. It looks like a combination of some problem with Nvidia’s hardware (I’m not an expert, so maybe it’s AMD’s fault, but it doesn’t sound that way to me) and the Linux kernel not doing stuff properly with GART pages. Other OSes don’t seem to be affected, either because they don’t use the hardware iommu or they do things correctly in the first place.

One sucky thing? Apparently Red Hat’s fix isn’t out yet for RHEL5 or RHEL4. Ugh. You can force the kernel to use software iommu instead, but I’m glad I’m not affected.

Most of our servers have over 4GB of RAM, and as you know doubt know, we’re pretty in love with our SunFire x2200 servers, most of which have 4GB – 32GB of RAM. So I fired off a frantic email late last night to Sun, asking them if our servers have the problem.

The good news? They don’t! Whew. Maybe I’ll get some sleep tonight… :)

FYI, there are some Sun servers (and plenty from every other vendor, too) that are affected. Here’s a link to Sun Alert 102790 with more info. Sun was also good enough to send along info on a similar-sounding, but different, issue in Sun Alert 102323.

My next question for Sun will be about how ZFS would handle silent data corruption like this, since it’s supposed to be quite resilient to strange hardware behavior. My bet is that this is likely outside of the scope of things ZFS can detect and avoid (I think it’s awesome at read error detection, but I’m not sure how it could tell that a write doesn’t contain the right data. But then, I’m not as smart as they are :) )

Anyway, hope this info helps some of you out. I know I’d want to know about this stuff.

Sun Honeymoon Update: Storage

Wednesday, May 16th, 2007

As I mentioned in my review of the Sun X2200 M2 servers we got recently, which we absolutely loved, Sun’s storage wasn’t impressive at all. In fact, it was downright bad. But before I get into the gory details, I feel compelled to mention that I believe Sun’s future, including storage, is bright. Their servers rock, they’re innovating all over the place, and for the most part, the people at Sun have been fantastic to work with – even when they’re being told their storage hardware sucks. That’s impressive. Now, on with the show:

The storage arrays I’m blogging about here are Sun StorEdge 3320 SCSI arrays. For more on why we chose this particular model, you can read about my on-going search for The Perfect DB Storage Array. The bottom line, though, for us is that Speed Matters. Their list price is quite expensive, but Sun was willing to work with us on the price, and we managed to get things into a reasonable ballpark. Reasonable, that is, as long as they performed. :)

First, some details. These boxes were destined to be part of our DB layer, with the first few going in as new storage for replication slaves. The goal was to maintain a high number of small (4K) i/o operations per second (IOPS) with an emphasis on writes, since scaling reads is easier (add slaves) than scaling writes (only so many spindles you can add, etc). In this particular case, the writes were being delivered from a MySQL master using InnoDB running Linux on 3 effective 7200rpm spindles, so the Sun array, on paper, should be able to keep up, no sweat. If your needs differ, our story might not be useful – test for yourself.

Installing and configuring them was an adventure. Craig Meakin, our Server Surgeon, was tasked with installing them and immediately ran into a snag. When configured for DHCP management access (which is how they were set up out of the box, exactly how we like them), they wouldn’t actually DHCP an IP address. It took someone at Sun wading through 4 different manuals to determine that not only did the array have to toggled to DHCP, but you also had to write “DHCP” in the IP address spot to make it work. Strike one.

(As an aside, one of Sun’s engineers also told us, after we’d bought them and installed them in the rack, that these storage arrays don’t come with battery backed write caches. Given how expensive they are, I was shocked and furious, but quickly got verification that they do, indeed, have BBWC.)

We brought one online and moved a DB slave snapshot over which was a few hours out-of-date and started replication so it could catch up with the master. Obviously, it wasn’t live and in production, so it was mostly spooling and committing writes from the master, only doing reads as needed for updates and whatnot. A very light load, in other words. With interest, we started timing how fast it would catch up, since it should scream. We were betting at least 2X (15K drives, after all) faster than the master, and on par with our other 15K SCSI slaves. Instead, we measured more than 4X *slower* than our slaves. Strike two.

Ok, no worries. Obviously this is a new array, and we did something terribly stupid setting it up. Sun support to the rescue, right? So we opened up tickets, dumped our config and all other relevant details to them. Nada. Oh, they came back with lots of suggestions and things to try. But none of them helped. Next step was to grab detailed system i/o statistics on production slaves which worked and Sun SE3320s that didn’t, so Sun could compare. And compare they did – their data showed a 6X performance differential between our production slaves (which had $700 off-the-shelf LSI SCSI MegaRAID cards in them) with 15K disks and Sun’s hardware. Sun was 6X slower. Final verdict? “System is performing as designed.” Strike three – they’re out!

Frantically, since the entire reason we had gone with Sun was because Rackable had shipped us a bunch of broken units and we were now months behind on an expansion project, we called Dell and ordered some PowerVault MD3000 SAS arrays. I always give Dell props for fast, efficient delivery, and knock them for a lack of innovation. In this case, they not only got us the gear fast, but the MD3000 turned out to be a fantastic DAS device and nearly perfect for our needs. Thank goodness!

Normally, that would be the end of our little tale, but as luck would have it, when Sun realized they’d laid an egg with the SE3320, they rushed us an engineering sample of their not-yet-announced (then) new StorEdge 2540 array. The good news? It performed neck-and-neck with the Dell array and uses SAS drives, which we prefer over SCSI. The bad news? They weren’t out yet and we needed storage yesterday. I believe they are out now, and I would buy the 100% SAS version, the StorEdge 2530, rather than 2540, for use in our datacenters if we hadn’t gotten the Dells.

So now we’ve got fantastic Sun servers attached to fantastic Dell storage. And our little franken-servers are as happy as can be. Fast, too.

Feed readers: Digg this story

Speed Matters.

Tuesday, May 15th, 2007

Alexa SmugMug Speed rating

As subscribers to my blog have probably already guessed, we spend an inordinate amount of time at SmugMug trying to optimize for speed. As a media-heavy website, that’s a difficult thing to do and there are a lot of pieces. A typical gallery page at SmugMug contains 16 photos (though may contain thousands), plus all of the other graphic elements on the page, JavaScript includes (we use lots of JS), CSS includes, and the page HTML itself.

We’ve long tracked our own internal “page render” time, but once it leaves our servers, it gets more difficult to track. There’s a huge, nasty mess of networking equipment and providers between our servers and each customer. There are paid services that will track some of this for you, but that doesn’t tell you what the actual customer experience is like. We have employees in Utah, Idaho, Ohio, Virginia, New York, London, the Netherlands, and Australia so we can get a decent idea, but nothing beats aggregate data.

Enter Alexa with their excellent service and the data it provides. They get a lot of publicity for their Traffic Rank and Reach stats, but they don’t help us hardly at all (we have tens of thousands of customers who use their own custom domains, for example, among other problems). The stat I really love is the Speed rating. Since Alexa aggregates data from millions of people all over the world, across all page views on a site (heavy and light), we can get a really good view of just how fast our site is:

Alexa SmugMug Speed rating

The usual disclaimers about statistics, particularly Alexa’s, apply: We don’t know exactly what they’re measuring, how much or often they’re measuring it, and how many people are actually measured. But we do know that Alexa’s Speed rating has directly correlated to feedback we get from our customers, and most importantly, our customer satisfaction. That’s good enough for me.

Now, like any Alexa statistic, such as Traffic Rank, it’s best viewed in relation to other sites, rather than alone. So here’s a bunch of photo-sharing sites, both ‘larger’ and ’smaller’ than SmugMug, and their Speed ratings in rough order of ’size’ according to Alexa’s Traffic Rank (again, Traffic Rank is notoriously flawed, but we have to order by *something*):

Fotolog:
Alexa Fotolog Speed rating

Flickr:
Alexa Flickr Speed rating

Photobucket:
Alexa Photobucket Speed rating

Webshots:
Alexa Webshots Speed rating

PBase:
Alexa PBase Speed rating

Kodak Gallery:
Alexa Kodak Gallery Speed rating

SmugMug:
Alexa SmugMug Speed rating

Shutterfly:
Alexa Shutterfly Speed rating

Snapfish:
Alexa Snapfish Speed rating

Now, we’re not perfect. SmugMug, like every other site on the net, has problems. But we try very very hard to keep the site speedy and responsive – and I think both the stats above and our customer satisfaction speaks volumes. And I think it’s only fair to note that some of those sites handle more page and photo requests per day than we do – but we’ve left the “small site” size behind long ago, so I wouldn’t discount our size too much. It’s probably only fair to note that with the possible exception of PBase, they all have massive financial resources in comparison to ours, though, too.

We have a huge laundry list of things we can do to speed the site up even more, so look for us to shave more milliseconds off your page load times as we go forward. And I have to thank our Ops team, Andrew Gibbons – Director of Operations & Craig Meakin – Server Surgeon, and programming team, Jimmy Thompson – Web Superhero & Lee Shepherd – SmugSorcerer. Couldn’t have gotten below 1 second without them!

If there’s enough interest, I can do a follow-up post on lots of the tricks we use to get there. I don’t think we do anything earth-shattering, but lots of small things add up. Let me know.

Amazon S3: New pricing model

Tuesday, May 1st, 2007

I’m getting emails about Amazon’s new S3 pricing model, so I guess the news is out. :)

For us, this is great. We’ll save money right off the top (we upload a lot, so $0.10/GB uploaded vs $0.20/GB uploaded is a big deal) first of all, and secondly, they finally have tiered download transfer costs. This is a big one for us, because we buy enough bandwidth that $0.20/GB wasn’t cost-effective enough for us.

I’m going to have to run some numbers (I’m at MIX right now) to see if it’s now good enough for us to start serving more content out of S3 or not, but even if it’s still not perfect for us, it’s a major move in the right direction.

Finally, this illustrates a subtle but important point of using S3. When I buy physical disks at SmugMug, those are sunk costs. They’ll never get cheaper because I’ve already paid for them. At Amazon, though, market forces and changes will cause their pricing model to continue to re-adjust downwards. As disks get cheaper, that $0.15/GB/month fee will drop. And instantly all of your storage magically gets cheaper, no sunk costs to worry about.

That happened today, and I’m sure it’ll happen over and over again as storage & bandwidth both get cheaper and Amazon is able to leverage their scale to get better deals. The more people use S3, the more Amazon can drive prices down.

Since we were already saving a ton of money using S3, this is music to my ears. :)

The Perfect DB Storage Array

Friday, April 27th, 2007

I’ve long known that YouTube had a secret weapon in their datacenter codenamed ‘Colin‘, but yesterday at the MySQL Conference, I met three more secret weapons – codenamed ‘Paul‘ and his team (sorry, guys, I’ve forgotten your names!).

Paul and his team are incredible. Paul’s keynote was easily the most interesting thing for me at the entire conference because of how technical and authoritative it was. It certainly helped that he spoke our language – he got down and dirty with his hardware, not just MySQL tuning variables, and discussed real world fixes. Plenty of other MySQL sessions were interesting, but most of them focused at a high level rather than down near the bare metal. We’ve long since left most of the high level stuff behind and are, ourselves, focused on bare metal.

Best of all, the MySQL team at YouTube sees eye-to-eye with us when it comes to DB storage arrays. There are a few differences, I think, but we’re essentially very similar. Hopefully my description of our ideal, perfect, high-performance DB storage array can help out any other startups out there looking for solutions. Certainly having our internal assumptions validated by YouTube helps.

I hate the “queries per second” or “queries per day” metrics, because they tell you absolutely nothing about how complicated or long the queries are, but we do many billions of queries per day, if you’re into those metrics. So we care a great deal about getting good, fast hardware.

The list:

  • We like DAS for our DB boxes, with RAID controllers in the external enclosure, rather than internal disks. This is one area I’m not sure YouTube agrees with us (they might, we just didn’t discuss it). Let me explain:
    • When a server has some fatal hardware problem, we like to just yank it out of the rack, slide another identical server in place, hook it up to storage, and turn it on. No mess, no fuss.
    • Using LVM, we can add more storage and/or more spindles easily.
    • Had problems in the past with RAID controllers failing and new ones not correctly picking up the RAID tags on the drives. External enclosures have two controllers, making single card failures less problematic.
  • We love spindles. The more the merrier. Our typical RAID 1+0 array has 14 of them, making 7 effective spindles. At best, that means we can do 7 concurrent operations at a time.
  • We love fast spindles. Give us 15K drives any day of the week.
  • We love enclosures with odd-numbered drives. 15 drives, 13 drives, something odd. Why? Because we want *1* hot spare, not 2, and want the rest of the spindles for reads & writes.
  • We love big battery-backed write caches. We stick them in write-back mode for super-fast writes (easily the hardest thing in a DB to scale).
  • We hate read caching. We disable it entirely. The cache on RAID controllers is relatively puny (128-512M) compared to the RAM in our DB boxes (32GB), so any reads that aren’t in our DB’s main memory certainly aren’t going to be in the RAID controller cache. We want every byte in the cache for writes. Plus, we don’t want read cache misses to get serialized behind the pending writes.
  • We hate prefetching. We disable it entirely. The DB is smart enough to request data it needs without the RAID controllers trying to be smart and tying up disks and the entire I/O path with extra data we don’t need.
  • We want very configurable stripe/chunk sizes. Some controllers just have presets, like “DB”, which often have tiny (16K) sizes. Ugh. We want 1MB+ stripes.

Now, unfortunately, finding arrays that do all of this stuff is tough. We end up haggling with vendors, or wrestling with configurations, etc. And usually we have to compromise on one or two of the items. :( I think we’re close to finding an ideal one, but we’re not quite there yet. You’ll hear it here first when we do.

If you’re willing to lose DAS, both LSI (and according to YouTube, Adaptec) let you get at most of the settings I mention above. I haven’t used 3Ware for a while, but I understand that they do not. If I’m wrong, someone please correct me.

Finally, our typical DB class machine is a Sun X2200 M2 with two dual-core CPUs and 32GB of RAM. The RAM and the disk stuff I talk about above are far far more important for our workload than the # or speed of CPUs, and it sounds like the same holds true at YouTube. We’re popping SAS cards in them and attaching to DAS units.

Anyway, hope that helps any of you out there wondering what to buy. I will still blog about Sun’s storage shortly (and why it didn’t match up to what we needed), I’ve just been busy. This should help add some context, though.

Give me a few more days. :)

UPDATE: Found one that does nearly everything we want – the Dell MD3000.

Sun Honeymoon Update: Servers

Wednesday, April 11th, 2007

It’s been two months since we divorced Rackable and married Sun as our new server & storage vendor and lots of people have been asking how it’s going. So while the ‘marriage’ is still early the server side of things is going really really well. We’re still starry-eyed in love. Our experience with Sun’s storage hardware isn’t nearly so rosy (in fact, it’s downright bad), but I’ll cover that in a near-future update.

So, what do we love about our new server partner?

  • We can standardize on a single server platform for 99% (if not 100%) of our future server needs. The SunFire X2200 M2 servers are 1U and scale up to 2 x dual-core Opterons with 32GB of RAM (and, as important, down to 1 Opteron w/2GB of RAM). For us, that’s huge. Imagine, if you will, some catastrophe befalling one of our database boxes that requires hardware replacement. Instead of having lots of expensive, idle, duplicate hardware around, we could literally crack open a web server, add some more RAM and an external HBA card, and boom, we have a new DB box. There are many reasons Southwest is the most profitable US airline and a huge one is standard components.
  • Their lights-out management (LOM) is a dream. I dinged the Sun T1000 last year because it’s LOM is pretty terrible, but the X2200’s LOM is freaking fantastic. How fantastic? Let me count the ways:
    • It’s ethernet rather than serial. Yay!
    • It can share the same ethernet port the OS does. One wire for both LOM and OS! Less datacenter mess. Double yay!
    • It has a built-in Web UI that lets you see and access all of the features, in addition to telnet and SSH.
    • The Web UI lets you actually view the VGA output on the console. Not just serial console redirection – actual video output.
    • The LOM lets you remotely mount ISOs, floppy images, etc. Got a CD or DVD on your desktop at the office that you wish was in the drive at your datacenter? No problem.
    • Built-in email notification ability for status changes.
    • Lots of SNMP settings. Haven’t played with this much yet, but it looks full-featured.
    • Lots and lots of hardware details, like motherboard and BIOS versions, NIC details, etc are all right there.
    • All of the statuses (fan speeds, temp readings, voltage indicators, etc), with tons of detail, are at your fingertips
  • Well built. First of all, it’s amazing what’s crammed into this 1U footprint. But second, it’s gorgeous inside. It’s clear that someone(s) spent a lot of time and energy working on the layout so that everything fit together just right. Feels like a labor of love. Nothing looks out of place.
  • I gave the T1000 props for the way Sun does illustrations on their lids to show what parts are hot-swappable vs cold-swappable and the X2200 is no exception. The lid is printed with all kinds of useful diagrams that make servicing the hardware much much easier. I’m a sucker for attention to detail (one reason I love Apple).
  • Turnaround time was excellent with both orders we’ve placed so far. We don’t have the luxury of planning for projects months and months in advance, so moving quickly when we need new hardware is key.
  • Pricing was great. Thanks to Sun’s AMD (and soon, Intel) server platforms, their pricing is competitive with everyone else. I truly believe that the baseline hardware (CPU, RAM, HDDs) has become commodity and that the differentiating value is in the extra technology (like LOM), service, and support. Sun gets this, I think.
  • Their rails just work. This is more rare than you might imagine – sucky rails really suck. Sun’s rails do what they’re supposed to – make it easy to install and, later, get access to your servers.
  • Their diagnostic CD was extremely useful and easy to use. This is an often overlooked area, but we were unlucky enough to get some bad RAM (see below), and this came in handy.
  • Fast. I thought this went without saying, since the performance bits are commodity components, but as you’ll see from the storage problems we had, speed on paper doesn’t always equal speed in the datacenter. These boxes are as fast as they should be – screaming.

So what’s not to like? Nothing’s bad enough that we’d kick Sun outta bed for eating crackers, but there are some quirks:

  • We bought these direct from Sun, with custom configurations, and I believe Sun is still trying to get their head around direct sales (vs VARs). As a result, it turns out that they arrived without all of the RAM already installed. No biggie, we just installed it ourselves. Only thing is, the RAM also wasn’t tested beforehand. We’re used to our systems being fully tested & burned-in prior to delivery, and sure enough, we got a bad piece of RAM. That sucked. For now, we’re just adding a day of burn-in to our install routine, but we’re hoping Sun standardizes on this in the future. UPDATE 1: Just got word from Sun, there is an option to have custom configs burned-in at no cost, but it adds an extra 2-3 weeks to the lead time. We’ll have to think about how to best use this here, since we usually want our gear fast.
  • As I mentioned in our engagement announcement, the sales and approval process (not the people) sucks. Having to go through the approval process over and over for each order that’s slightly different isn’t pleasant. Dell excels at this, by comparison. They fire off quotes (and hardware!) with lightning speed. Here’s how I wish it would work:
    • Sun goes through the approval process for SmugMug and assigns us a discount.
    • From then on, we can just go login to sun.com and place orders for as much (or as little) hardware as we want that day and it automagically applies our discount.
    • Should we think our sales volume warrants a bigger discount or something, we re-engage to re-evaluate.
    • Our sales team at Sun gets to focus on keeping us up-to-date on new technology, roadmap changes, and everything else without wasting time on the approval process for small orders that are similar to orders we’ve placed in the past.
    • We’re happy, Sun’s happy, everyone’s happy.

If we could change anything about them, would we? Of course!

  • Love to see dual power supplies. Since power supplies are a very common failure point for servers, we like redundancy here. (The moving parts fail far more often than our circuits do, so surprisingly, we don’t want dual power supplies to handle circuit failures).
  • While we’re dreaming, I’d love to see DC power as an option and remove AC from the equation. We could get lower failure rates, better power utilization, and better redundancy in one fell swoop.
  • And if we really want to get pie-in-the-sky, I’d love to see some sort of liquid or gas cooling system so we can get cooling efficiencies too. This is way outside of my field of expertise, so I don’t know how it would work, but Blackbox seems like it has some great stuff along these lines.

Stuff we really haven’t kicked the tires on yet:

  • We typically whip out our amp meter and take power readings as soon as we get new hardware in our datacenter, since power & cooling are huge concerns for us. This time, we were under such a time crunch (and so busy with all of the nasty storage problems I’ll blog about soon), I haven’t had time. I’m hopeful that all of Sun’s noise about power efficiency is reflected, but I won’t know for sure until I get the hardware out and test it.

And finally, everyone at Sun deserve a shout out. They’ve built a great product, and they’ve certainly showed us a great deal of support and personal attention, which we appreciate. If the people we’ve dealt with are any indicator of upcoming success, Sun’s future looks bright. (No pun intended).

I will post a follow-up shortly detailing the nightmare that our quest for fast DB storage became and what we’ve managed to do about it, but for now, I hope this helps anyone looking for server solutions.

Bottom line: I can’t recommend the X2200 M2 highly enough.