Success with OpenSolaris + ZFS + MySQL in production!

Pimp My Drive by Richard and Barb

Pimp My Drive by Richard and Barb

There’s remarkably little information online about using MySQL on ZFS, successfully or not, so I did what any enterprising geek would do: Built a box, threw some data on it, and tossed it into production to see if it would sink or swim. :)

I’m a Linux geek, have been since 1993 (Slackware!). All of SmugMug’s datacenters (and our EC2 images) are built on Linux. But the current state of filesystems on Linux is awful, and it’s been awful for at least 8 years. As a result, we’ve put our first OpenSolaris box into production at SmugMug and I’ve been pleasantly surprised with the performance (the userland portions of the OS, though, leave a lot to be desired). Why OpenSolaris?

ZFS.

ZFS is the most amazing filesystem I’ve ever come across. Integrated volume management. Copy-on-write. Transactional. End-to-end data integrity. On-the-fly corruption detection and repair. Robust checksums. No RAID-5 write hole. Snapshots. Clones (writable snapshots). Dynamic striping. Open source software. It’s not available on Linux. Ugh. Ok, that sucks. (GPL is a double-edged sword, and this is a perfect example). Since it’s open-source, it’s available on other OSes, like FreeBSD and Mac OS X, but Linux is a no go. *sigh* I have a feeling Sun is working towards GPL’ing ZFS, but these things take time and I’m sick of waiting.

The OpenSolaris project is working towards making Solaris resemble the Linux (GNU) userland plus the Solaris kernel. They’re not there yet, but the goal is commendable and the package management system has taken a few good steps in the right direction. It’s still frustrating, but massively less so. Despite all the rough edges, though, ZFS is just so compelling I basically have no choice. I need end-to-end data integrity. The rest of the stuff is just icing on an already delicious cake.

The obvious first place to use ZFS was for our database boxes, so that’s what I did. I didn’t have the time, knowledge of OpenSolaris, or inclination to do any synthetic benchmarking or attempt to create an apples-to-apples comparison with our current software setup, so I took the quickest route I could to have a MySQL box up and running. I had two immediate performance metrics I cared about:

  • Can a MySQL slave on OpenSolaris with ZFS keep up with the write load with no readers?
  • If yes, can the slave shoulder its fair share of the reads, too?

Simple and to the point. Here’s the system:

  • SunFire X2200 M2 w/64GB of RAM and 2 x dual-core 2.6GHz Opterons
  • Dell MD3000 w/15 x 15K SCSI disks and mirrored 512MB battery-backed write caches (these are really starting to piss us off, but that’s another post…)

The quickest path to getting the system up and running resulted in lots of variables in the equation changing:

  • Linux -> OpenSolaris (snv_95 currently)
  • MySQL 5.0 -> MySQL 5.1
  • LVM2 + ext3 -> ZFS
  • Hardware RAID -> Software RAID
  • No compression -> gzip9 volume compression

Whew! Lots of changes. Let me break them down one by one, skipping the obvious first one:

MySQLMySQL 5.1 is nearing GA, and has a couple of very important bug fixes for us that we’ve been working around for an awfully long time now. When I downloaded the MySQL 5.0 Enterprise Solaris packages and they wouldn’t install properly, that made the decision to dabble with 5.1 even easier – the CoolStack 5.1 binaries from Sun installed just fine. :)

Going to MySQL 5.1 on a ~1TB DB is painful, though, I should warn you up front. It forced ‘REPAIR TABLE’ on lots of my tables, so this step took much longer than I expected. Also, we found that the query optimizer in some cases did a poor job of choosing which indexes to use for queries. A few “simple” SELECTs (no JOINs or anything) that would take a few milliseconds on our 5.0 boxes took seconds on our 5.1 boxes. A little bit of code solved the problem and resulted in better efficiency even for the 5.0 boxes, so it was a net win, but painful for a few hours while I tracked it down.

Finally, after running CoolStack for a few days, we switched (on advice from Sun) to the 5.1.28 Community Edition to fix some scalability issues. This made a huge difference so I highly recommend it. (On a side note, I wish MySQL provided Enterprise binaries for 5.1 for their paying customers to test with). The Google & Percona patches should make a monster difference, too.

Volume management and the filesystem – There’s some debate online as to whether ZFS is a “layering violation” or not. I could care less – it’s pure heaven to work with. This is how filesystems should have always been. The commands to create, manage, and extend pools are so simple and logical you basically don’t even need man pages (discovering disk names, on the other hand, isn’t easy. I finally used ‘format’ but even typing it gives me the shivers…).

zpool create MYPOOL c0t0d0

You just created a ZFS pool. Want a mirror?

zpool create MYPOOL mirror c0t0d0 c0t0d1

Want a striped mirror (RAID-1+0) w/spare?

zpool create MYPOOL mirror c0t0d0 c0t0d1 mirror c0t0d2 c0t0d3 spare c0t0d4

Want to add another mirror to an already striped mirror (RAID-1+0) pool?

zpool add MYPOOL mirror c0t0d5 c0t0d6

Get the idea? Super-easy. Massively easier than LVM2+ext3 where adding a mirror is at least 4 commands: pvcreate, vgextend, lvextend, resize2fs – usually with an fsck in there too.

Software RAID – This is something we’ve been itching for for quite some time. With modern system architectures and modern CPUs, there’s no real reason “storage” should be separate from “servers”. A storage device should be just a server with some open-source software and lots of disks. (The “open source” part is important. I’m sick of relying on closed-source RAID firmware). The amount of flexibility, performance, reliability and operational cost savings you can achieve with software RAID rather than hardware is enormous. With real datacenter-grade flash storage devices just around the corner, this becomes even more vital. ZFS makes all of this stuff Just Work, including properly adjusting the write caches on the disk, eliminating the RAID-5 write hole, etc. Our first box still has a battery-backed write-cache between the disks and the CPU for write performance, but all the disks are just exposed as JBOD and striped + mirrored using ZFS. It rocks.

Compression – Ok, so this is where the geek in me decided to get a little crazy. ZFS allows you to turn on (and off) a variety of compression mechanisms on-the-fly on your pool. This comes with some unknown (depends on lots of factors, including your workload, CPUs, etc) performance penalty (CPU is required to compress/decompress), but can have performance upsides too (smaller reads and writes = less busy disk).

InnoDB is notoriously bad at disk usage (we see 2X+ space usage using InnoDB) and while it’s not an enormous concern, it’d be something nice to curtail. On most of our DB boxes, we have idle CPU around (we’re not really I/O bound either – MySQL is a strange duck in that you can be concurrency bound without being either CPU or I/O bound fairly easily thanks to poor locking), so I figured I’d go wild and give it a shot.

Lo and behold, it worked! We’re getting a 2.12X compression ratio on our DB, and performance is keeping up just fine. I ran some quick performance tests on large linear reads/writes and we were measuring 45.6MB/s sustained uncompression and 39MB/s sustained compression on a single-threaded app on an Opteron CPU. We’ll probably continue to test compression stuff, and of course if we run into performance bottlenecks, we’ll turn it off immediately, but so far the mad science experiment is working.

Configuration

Configuring everything was relatively painless. I bounced a few questions off of Sun (imho, this is where Sun really shines – they listen to their customers and put technical people with real answers within arms reach) and read the Evil Tuning Guide to ZFS. In the end I really only ended up tweaking two things (plus setting compression to gzip-9):

  • I set the recordsize to match InnoDB’s – 16KB.
    zfs set recordsize=16K MYPOOL
  • I turned off file-level prefetching. See the Evil Tuning Guide. (I’m testing with this on, now, and so far it seems fine).

I believe since ZFS is fully checksummed and transactional (so partial writes never occur) I can disable InnoDB’s doublewrite buffer. I haven’t been brave enough to do this yet, but I plan to. I like performance. :)

Performance

This box has been in production in our most important DB cluster for two weeks now. On the metrics I care about (replication lag, query performance, CPU utliization, etc) it’s pulling its fair share of the read load and keeping completely up on replication. Just eyeballing the stats (we haven’t had time to number crunch comparison stats, though we gave some to Sun that I’m hoping they crunch), I can’t tell a difference between this slave and any of the others in the cluster running Linux. I sure feel a lot better about the data integrity, though.

Why not [insert other OS here]?

We could have gone with Nexenta, FreeBSD, Mac OS X, or even *gulp* tried ZFS on FUSE/Linux. To be honest, Nexenta is the most interesting because it actually *is* the Solaris kernel plus Linux userland, exactly what I wanted. I’ve played with it a tiny bit, and plan to play with it more, but this is a mission-critical chunk of data we’re dealing with, so I need a company like Sun in my corner. I find myself wishing Sun had taken the Nexenta route (or offered support for it that I could buy or something). Instead, we’ll be buying software service & support from Sun for this and any other mission-critical OpenSolaris boxes.

FreeBSD also doesn’t have the support I need, Mac OS X wasn’t performant enough the last time I fiddled with it as a server, and most FUSE filesystems are slow so I didn’t even bother.

Gotchas

  • On my 64GB Linux boxes, I give InnoDB 54GB of buffer pool size. With otherwise exactly the same my.cnf settings, MySQL on OpenSolaris crashes with anything more than 40GB. 14GB, or 21.9% of my RAM, that I can’t seem to use effectively. Sun is looking into this, I’ll let you know if I find anything out.
  • For a Linux geek, OpenSolaris userland is still painful. Bear in mind that this is a single-purpose box, so all I really want to do is install and configure MySQL, then monitor the software and hardware. If this were a developer box, I would have already given up. OpenSolaris is still very early, so I’m still hopeful, but be prepared to invest some time. Some of my biggest peeves:
    • Common commands, like ‘ps’, have very different flags.
    • Some GNU bins are provided in /usr/gnu/bin – but a better ‘ps’ is missing, as is ‘top’ (no, ‘prstat’ is *not* the same!), ’screen’, etc (Can anyone even use remote command-line Unix boxes without ’screen’? If so, how?)
    • Packages are crazily named, making finding your stuff to install tough. Like instead of Apache being called ‘apache’ or ‘httpd’, it’s called ‘SUNWapch’. What?
    • After finally figuring out how to search for packages to get the names (‘pkg search -r Apache’ – which doesn’t provide pleasant results), I discovered that ‘top’ and ’screen’ just simply aren’t provided (or they’re named even worse than I thought). Instead, I had to go to a 3rd party repository, BlastWave, to get them. And then, of course, the ‘top’ OpenSolaris package wouldn’t actually install and I had to manually break into the package and extract the binary. Ugh.

Whew! Big post, but there was a lot of ground to cover. I’m sure there are questions, so please post in the comments and I’ll try to do a follow-up. As I fiddle, tweak, and change things I’ll try to post updates, too – but no promises. :)

UPDATE: One other gotcha I forgot to mention. When MySQL (or, presumably, anything else running on the box) gets really busy, user interactivity evaporates on OpenSolaris. Just hitting enter or any other key at a bash prompt over SSH can take many seconds to register. I remember when Linux had these sort of issues in the past, but had blissfully forgotten about them.

UPDATE: I went more in depth on ZFS compression testing and blogged the results. Enjoy!

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , ,

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

82 Responses to “Success with OpenSolaris + ZFS + MySQL in production!”

  1. Chris Ryland Says:

    Have you guys ever seriously tried Postgres? Seems like it’s a much more performant “large system” database…

  2. Glynn Foster Says:

    Nice post Don – always keen to hear experiences of using OpenSolaris (especially what things you’re tripping up against) and thanks for your support,. Our obvious goal is to avoid making it the frustrating experience it has been in the past. We should have ‘top’ and ’screen’ both available for 2008.11 (from b100). Search should be drastically improved for 2008.11, but the package refactoring/naming won’t happen until 2009.04. Also interested in any userspace slowness metrics you have too!

  3. andy.edmonds.be › links for 2008-10-10 Says:

    [...] SmugBlog: Don MacAskill » Blog Archive » Success with OpenSolaris + ZFS + MySQL in production! (tags: zfs mysql solaris unix) This was written by andy. Posted on Saturday, October 11, 2008, at 1:30 am. Filed under Delicious. Bookmark the permalink. Follow comments here with the RSS feed. Post a comment or leave a trackback. [...]

  4. Mark Callaghan Says:

    Thanks for all of the details and for the awesome mask pictures — http://cmac.smugmug.com/gallery/2504559#131481399_ZnZmK

    I think that compression in ZFS will work better than compression in the storage engine as InnoDB has in the 5.1 plugin because more of it can be done in the background by the kernel rather than by the thread executing a query.

    15 15k disks sounds nice. I wish I had such gear. Do you ever get write bound? Turning off the double write buffer can make a big deal when there is a lot of writing to be done. With it on you do: large (<= 2MB) sequential IO to the doublewrite buffer, sync, <= 128 random IOs to the database, sync.

    Do you run with buffered IO or O_DIRECT? Make MySQL give you a build with support for more InnoDB background IO threads and configurable background IO rate limiting. With a few changes you might be able to get much more out of the very nice system (ZFS + many fast disks + big RAM) you are running.

  5. Don MacAskill Says:

    @Chris Ryland: First of all, I tend to choose my open-source technologies by the size of their user community. So Postres fell down there first. Second, the replication was just awful the last time I used it. Unusable, basically, which was a deal-killer. Replication (even with the major limitations MySQL currently has) is one of MySQL’s secret weapons.

  6. Don MacAskill Says:

    @Glynn Foster: Hey, great news! Thanks! I updated the blog post about userspace slowness because that was another gotcha I forgot to mention. It gets unbearably slow sometimes (I assume it’s hoarding all the slices for MySQL or something).

  7. Don MacAskill Says:

    @Mark Callaghan: Thank *you* for all the patches. Before Google stepped up to the plate, I was seriously losing hope in InnoDB.

    I agree, filesystem compression sounded like the saner choice to me as well, hence the experiment. So far it seems to really be paying off. I’d like to find out exactly how much latency it adds, but it doesn’t seem to really be human measurable, at least with this load on this system, which is good enough for me to leave it on and keep watching.

    We’re not really write limited currently, no, but I’ve found that having lots of fast disks can mask the concurrency problems with InnoDB in many cases. I assume that’s just because the writes are returning faster so the locks aren’t being held as long, but I don’t know for sure. You’d probably know better than I would. :) I didn’t mention it in the post, but late last night we added another 15 x 15K disks to each of the members in this cluster, so they’re actually all 30 now. Adding to the pool with ZFS was so insanely simple it boggled my mind. (We added the disks due to storage constraints, not I/O constraints)

    ZFS doesn’t support O_DIRECT, so this slave is using buffered IO. The Linux boxes in the cluster all use O_DIRECT and we’ve seen significant gains with it. A few people at Sun have said they’ve seen workloads on ZFS where using a much smaller InnoDB buffer pool and relying on ZFS’s disk cache resulted in performance increases, but that’s deviating even farther from a direct comparison to our Linux slaves, so I haven’t played with it yet. It’s on the list.

    And yes, I’m dying to get a MySQL build with all the great patches you guys have provided, preferably against 5.1.28. It’s on my todo list – I’ll build it myself if I have to, but keeping track of which trees have which patches is getting interesting. We went from a complete drought of patches to a flood in no time – thank goodness! :) Looks like you’re steadily updating the stuff on Launchpad, so I’ll check there first.

  8. Mark Callaghan Says:

    @Don – I am thrilled that there are many outlets for patches now. I get to lobby Percona and MySQL and Drizzle to use bits of the Google patch and eventually end-users will benefit. It also helps that MySQL users like you document the problems. People at MySQL are much more aware of the problems because of this.

    With buffered IO there is not much benefit from more background IO threads for writes, but they still help for reads. With respect to shrinking the InnoDB buffer cache, up to half of it can be used for the insert buffer. On my workloads I think that I get at least a 5X reduction in writes for secondary index maintenance because of the insert buffer.

  9. Dan Says:

    OpenSolaris is a win. But, even more than ZFS, dtrace is the reason in my book.

    And I agree that if Postgres replication were better, it would clearly be the way to go over MySQL.

  10. Brian Aker Says:

    Hi!

    What were the nature of the “A little bit of code solved the problem and resulted in better efficiency” that you made?

    It is good to see some real world interest in ZFS.

    Cheers,
    -Brian

  11. D. Price Says:

    Don,

    As a member of the team building the OpenSolaris packaging system (rogue’s gallery here: http://dp.smugmug.com/gallery/4882941_mCL2n#291253680_XLbpL ), and as a loyal and longtime smugmug customer (and advocate) I can say that we are working hard to improve the packaging system as fast as possible. And I think there is alignment between what you want, and what we are trying to deliver.

    As Glynn said, the 2008.11 release, while not yet perfect, is going to be better. Searches are now case-insensitive by default, as an example. The output of ‘pkg list -a’ will be more useful. Performance of the ‘pkg’ command should be better (there’s more to do, though). We’re more robust in the face of network problems. ‘pkg verify’ works a lot better. The depot server web pages at http://pkg.opensolaris.org should be more attractive and more useful pretty soon. And so forth. On the free software side, things are getting better too, although I’m similarly annoyed about the lack of ’screen.’ I thought it had already been done, but apparently not. Time to go harass some people.

    While the os may provide you a ‘top’, prstat is better and you should use it. :) Besides being a pile more efficient than top, one especially nice feature in prstat is that process RSS calculations when considering aggregations of processes (like with -a, -Z, or -J) are actually accurate, accounting properly for sharing in the VM system. prstat -m is also amazingly useful because of the much finer grained detail about process events.

    Best wishes, and thanks for being a maverick. I just ordered some prints the other day , on lustre paper, and they came out awesome.

    -dp

  12. rb Says:

    you might want to check out Nexenta. its a GNU userland based off Ubuntu packages thrown on top of the OpenSolaris kernel. to me, it feels like kind of a bizarre hybrid, but I feel like Solaris sucks so hard, I’d rather use something a little wonky thats close to Linux than just use something wonky.

    there’s always FBSD too, although it lacks the Zones that the Solaris kernel provides.

  13. William Hathaway Says:

    Since you (currently) want to give a lot of the system’s memory to MySQL, have you considered limiting the size of ZFS’s ARC cache? The Evil Tuning Guide talks about this (including an arcstats script which can be helpful to understand how useful the ARC is being for you).

  14. Kris Warkentin Says:

    Once you start using dtrace with serious intent, you will never wish for the Linux userspace again. Dtrace is the wind beneath my wings.

  15. Silveira Neto Says:

    Excellent and well written post Don.
    I saved it to show for some friends. ;)

  16. patrick giagnocavo Says:

    I would suggest the following:

    1. check what your default scheduler is with “dispadmin -l ” and possibly consider using the Fair Share Scheduler – should let interactivity get back to normal without interfering with MySQL.

    2. Alternatively try using gzip on its regular setting instead of with -9 ; should see almost the same efficiency on disk space, while CPU will be much less. Are the disks busy when you see the issue with interactive response?

    3. For better visibility on your disks, use “iostat -xnc 5″ which will give you disk stats every 5 seconds. Look at the average service time and percentage of disk busy.

    4. There may be some system tunables to tweak – possibly the reason you have 14GB unavailable is due to one of the following: max size of shared memory segments, disk cache reserves a set amount of RAM, kernel-related space taken for mapping ZFS and various devices.

  17. Jim Zemlin Says:

    We told you ZFS didn’t matter, it’s just a feature, you have to listen, or we’re going to kick you out of the community. On second thought, if you use ZFS, the Linux Foundation will have to sue you.

  18. Erik Ljungstrom Says:

    Hi Don, good post!
    I’ve had some pleasant experiences with ZFS as well, it’s a very decent fs. I’m however keeping a very close eye on btrfs as well since it’s got most of the aspects I like with ZFS and it’s got some good very ideas behind it. It’ll be mighty interesting to battle that out against other filesystems.

  19. Logan Shaw Says:

    Very interesting article. I do have one piece of advice, though. For the love of god, reconsider that “gzip-9″ choice! There’s nothing wrong with using the gzip algorithm, but the “-9″ is not even close to the sweet spot in the trade-off between compression ratio and CPU usage. “-9″ will only get you something like 0.5% better compression than “-7″, and it will use more than double the CPU time.

    Here’s a quick little test you can do as an illustration of what I mean. It uses the regular gzip command line program, but it’s the same algorithm ZFS will be using. If I’m lucky, the formatting won’t get too mangled:

    #! /bin/sh

    cd /tmp || exit 1

    # create a test file to compress
    ( cd /etc && tar cf – . ) > etc.tar

    # try all 9 levels
    for level in 1 2 3 4 5 6 7 8 9
    do
    echo level $level
    time gzip -v -$level /dev/null
    done

    rm etc.tar

    You should see that after you get past about level 4, the compression ratio only improves a little bit, but the CPU usage really goes through the roof.

    Disk space is just not so expensive that this is worth it, even on modern systems with powerful CPUs. I’d set it to gzip-4 or gzip-3 and try that.

  20. Logan Shaw Says:

    WordPress ate my redirection characters and more on the gzip line. The script should be this (hopefully right this time):

    #! /bin/sh

    cd /tmp || exit 1

    # create a test file to compress
    ( cd /etc && tar cf – . ) > etc.tar

    # try all 9 levels
    for level in 1 2 3 4 5 6 7 8 9
    do
    echo level $level
    time gzip -v -$level < etc.tar > /dev/null
    done

    rm etc.tar

  21. Ben Scherrey Says:

    Great article and nice job leading the way on this. ZFS & Dtrace are definitely where we need to go for our database servers. Plan is Postgres on Nexenta but we may need to go with plain OpenSolaris for the same reasons you suggested. Glad to hear ZFS is giving you the performance you’re needing and hope you’ll continue to post. Hopefully I’ll have some good news in a few weeks.

  22. Gil Megidish Says:

    Solaris is the shizz. Don, thanks for opening up so much information. It’s really helpful. I haven’t used Solaris for so many years now, and it kicks linux so hard. I want dtrace :(

  23. Don MacAskill Says:

    @Brian Aker:

    The “little bit of code” was simply removing an ORDER BY on the affected SELECTs and instead sorting on the client – something I prefer to do whenever possible anyway. 5.1 was choosing the wrong (imho) index to use with the ORDER BY, but was fine without it.

  24. Don MacAskill Says:

    @D. Price:

    Awesome! So glad you guys are working on the problem. :)

    It’s not that I’m not fond of ‘prstat’ – I am. I have a ‘prstat’ running constantly in one of my other screens, too (along with mpstat, iostat, and vmstat) But I really really like having a single screen I can quickly glance at and see CPU stats, RAM stats, etc. On that particular screen, I’m not nearly as interested in process stats other than a really quick overview. I view ‘top’ as a really handy quick overview. Then if something is amiss or I don’t understand something, I use the other more detailed tools. Make sense?

  25. Don MacAskill Says:

    @William Hathaway:

    Yep, playing around with limiting the ARC is on the todo list. Thanks! :)

  26. Don MacAskill Says:

    @Logan Shaw:

    Yep, testing the various gzip levels (and the other possible ZFS compression options) is on the todo list as well. I just wanted to do something quick & dirty for now.

    FYI, even with gzip-9, we usually have tons of free CPU available, so I don’t believe it’s impacting system performance significantly.

  27. neoTactics » More ZFS on EC2 Says:

    [...] like Don MacAskill discovered the joys of MySQL on ZFS running on [...]

  28. Randy Bias Says:

    Don, I’ve been running ZFS in production for almost 2 years now. I recommend LZJB over any of the GZIPs. You’ll find the compression is still roughly 2x, but you’ll get back some CPU, which will be useful while running scrubs, which are extremely CPU intensive. I’ve tried pretty much every setting of compression and there usually isn’t a significant difference between LZJB and GZIP for *most* (but not all) data types. It’s usually within 10%.

  29. Silveira Neto » Blog Archive » SmugMug experience with OpenSolaris + ZFS + MySQL Says:

    [...] a look in this interesting post of Dan MacAskill, CEO of SmugMug, about his experiences on OpenSolaris servers with ZFS and [...]

  30. jd Says:

    Sounds like OpenSolaris hasn’t come very far along then. The goal of supplying gnu binaries and a real linux-alike environment is spot-on, what’s taking so bloody long to get it right? I guess my shop will be stuck doing the tedious “installing 101 sunfreeware packages to make Solaris 10 even remotely usable, then another 101 to have a reasonable dev/build environment” routine after each new rollout for a long time to come. War and Peace is probably shorter than our tasklist for this :(

  31. Joerg M. Says:

    1. You ssh issue looks strange … i didn´t had this problem so far on my mysql systems.
    2. Did you tried /usr/ucb/bin/ps ?

  32. Jeffrey M. Hunter Says:

    Don. Excellent article. I am a Sr. Database Administrator but almost exclusively Oracle and SQL Server. I have been waiting with baited breath for Oracle to release their latest version Oracle 11gR1 on the Sun Solaris x86_64 (64-bit) Version 10. From their certification site, it shows the status as “pending”. Again, fantastic article!

  33. Don MacAskill Says:

    @Joerg M:

    $ ls /usr/ucb/bin
    /usr/ucb/bin: No such file or directory

    No dice. Any other ideas?

  34. Von Freud » Blog Archive » MySQL 5.1+ ZFS Says:

    [...] all the InnoDB geeks out there, there is a very good article about running OpenSolaris + ZFS + MySQL, in production! Lot’s of technical [...]

  35. Jason Says:

    That should be /usr/ucb/ps (no /bin).

    One big difference between Solaris and Linux is that Solaris supports a lot of different standards for commands. In some cases these might conflict. The solution is that the different versions of these utilities reside in separate directories, and you chose $PATH to give the behavior you desire (GNU utilities in /usr/gnu/bin, various posix revisions in /usr/xpg4 and /usr/xpg5, etc.). In this instance, being more used to the BSD style ps than SysV ps, you want the ps in /usr/ucb (you’ll see other utilities that behave like BSD versions in there as well).

  36. codestr0m Says:

    Great post and very interesting! Small nit about the OpenSolaris userland comments.. I’ve been an avid Linux user/developer/admin for quite a few years and made the switch like you.. While I must say it takes some getting used to the spartan approach to bringing your own shell rc scripts, vimrc and adjusting paths it’s really quite trivial if you know what you’re doing and don’t mind spending an extra minute here and there to take a look at the rosetta stone for these things. Just a thought since OpenSolaris makes a *great* developer box as well, but just wait until IPS breaks something when you reboot ;)

  37. Don MacAskill Says:

    @codestr0m:

    See, you nailed it right on the head. Time is absolutely our very most precious commodity (I feel like our business is years behind where we really should be right now), so a minute here or a minute there (multiplied by the the # of developers) is actually a serious problem.

    Our goal with our systems is to automate them so much that they don’t suck up *any* time, and so natural for our developers that they can code without mucking around with system settings.

  38. Kent Says:

    Hi Don, great article. We use OpenSolaris at Joyent as the basis for our Cloud computing IaaS. So, it’s really cool to see this article. We pretty much replaced the packaging system with pkgsrc. The results have been good and now we get comments more like, “OK, so as of right now I have pretty much copied over my environment. The new template is so fully configured that pretty much everything that had been causing me issues is now just installed by default.” To do this we maintain a fairly large pkgsrc repo for our clients. Also we’ve heavily modified the userland. We find that these changes we’ve made along the way have made it a lot easier and usable for people when moving to Solaris. We have a long way to go still but the power of DTrace, SMF, ZFS, stability, and scalability have been great.

  39. SmugBlog: Don MacAskill » Blog Archive » ZFS & MySQL/InnoDB Compression Update Says:

    [...] « Success with OpenSolaris + ZFS + MySQL in production! [...]

  40. Don MacAskill Says:

    There are more compression details in my follow-up: http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/

  41. Ann E. Mouse Says:

    Did I say something wrong? Or was my comment not substantial enough to include?

    Thanks, I will have some more substantive after I have time to go back and review some of my configs. I enjoy reading your blog and share your pain seeing the stupidity of Canon USA – ouch.

  42. Don MacAskill Says:

    @Ann E. Mouse:

    Sorry, but I don’t see a comment from you – either posted or in the moderation queue. Would you mind re-posting it? Thanks!

  43. Ann E. Mouse Says:

    Don, thanks, sorry I thought for sure it went through but got rejected. Can you delete these two and I’ll repost.

  44. Jamie Says:

    So you got Solaris working with a md3000, with or without multipathing? Which HBAs? With AVT or without? I’d love to see some more details on that setup.

  45. Don MacAskill Says:

    @Jamie:

    You know, it didn’t even occur to me that it might not work. Haha. We’re just using a LSI SAS HBA. No multipathing. Dunno what AVT even is (or just don’t recognize the acronym). We just plugged it in and turned it on and it worked. *shrug*

    I’m so used to modern OSes just doing the right thing with device drivers I didn’t even bother to check compatibility first.

  46. UX-admin Says:

    “For a Linux geek, OpenSolaris userland is still painful.”

    Of course it is, since you don’t know / understand System V yet. Learn it. Use it. Love it!

  47. UX-admin Says:

    “$ ls /usr/ucb/bin
    /usr/ucb/bin: No such file or directory

    No dice. Any other ideas?”

    /usr/ucb is SunOS 4.x BSD compatibility. Other than being there for really old BSD scripts from SunOS 4.x days, it had no business being on a modern System V system. That’s why you don’t have it. And most knowledgeable Solaris sysadmins (and ALL system engineers) will purposely leave it out when churning their own Solaris builds.

  48. Don MacAskill Says:

    @UX-admin:

    I don’t have the time to learn SysV. I have a business to run and millions of users to support – and Linux dominates now. Solaris needs to adapt or it won’t survive – not the other way around. It’s sad but true.

  49. Jamie Says:

    OK cause I’ve been trying (and failing) to make Solaris x86 10 update 5 work with a SAS-attached MD3000 including multipathing and failover, and haven’t had any luck without enabling Automatic Volume Transfer, which I’m reluctant to rely on (groundless suspicion on my part really). It appears that the necessary bits (particularly /kernel/misc/scsi_vhci/scsi_vhci_f_asym_lsi) are only part of more recent vintages (OpenSolaris has it from the looks of things).

  50. Links for 12 Oct 2008 - 14 Oct 2008 :: Col’s Tech Stuff Says:

    [...] Success with OpenSolaris + ZFS + MySQL in production! – Woohooo!!!! Another happy OpenSolaris + ZFS customer. Nice one Don. [...]

  51. Alex Says:

    I’ve been interested in ZFS and OpenSolaris for quite a while, whilst some of the things it offers are great, most of your post seems to confirm one thing – it’s a bit of a hassle.

    Are the benefits of ZFS worth it? Is there a decent performance increase to justify it all? At the moment it all sounds like you’re doing a lot of beta testing with OpenSolaris!

  52. Greg Says:

    As Jason suggests above, I just alias /usr/ucb/ps to ps.
    I.E. in .bashrc
    alias ps=’/usr/ucb/ps -auxwww’

  53. marty duffy Says:

    On the problem with getting the package names, I believe there is a bundle “amp-dev” that will load apathe2, Mysql, and PHP on on opensolaris command “pkg add amp-dev”

  54. Unix Emaxer Says:

    JD,

    It seems you don’t have your Solaris build environment put together very well. We use a lot of
    FOSS tools here at work, 100s if not more and all are installed painlessly and effortlessly
    thanks to a well put together jumpstart environment.

    If you are still building Solaris system with CDs or a basic Jumpstart server, your not much of
    an SA or if your not in charge, the guys building systems for you are doing a piss poor job if you
    have to trudge through 100s of FOSS packages each time you build a Solaris system.

    ’nuff said….

    UE

  55. Log Buffer #119: a Carnival of the Vanities for DBAs Says:

    [...] The Google/Percona/OurDelta patches make another appearance on SmugBlog, where Don MacAskill’s gives a detailed report of his success with OpenSolaris + ZFS + MySQL in production. [...]

  56. This Week on MA.GNOLIA « /home/kOoLiNuS Says:

    [...] SmugBlog: Don MacAskill » Blog Archive » Success with OpenSolaris + ZFS + MySQL in production! [...]

  57. Evan Says:

    Don

    Thanks for this. Not to channel Hilary Clinton, but ‘it takes a community”! And we’re all going to get ZFS into the hands of millions of users thanks to useful commentary like this that adds to the critical mass of our community.

    ALSO, in one of your responses to comments you say ‘time is our most precious commodity.’ Amen. I’m biased as making ZFS easy to use is our #1 focus (and storage for VMware is next). But I really think tech projects and companies would be much better off if we spent a little more time on ‘it just works’ vs. ‘you can get it to do really cool things if you just spend a day or two on it.’

    In your post you say nice stuff about the NexentaOS and suggest that absent support it isn’t a viable alternative for you. We’re discussing partnering with serious support organizations to deliver support to NexentaOS (nexenta.org) users and hope to be able to announce something by the end of the year. Folks can contact me at evan at nexenta.com if they’d like to discuss.

  58. Marty Says:

    Put /usr/ucb early in your path and all your usual BSD stuff will work.

    export PATH=/usr/ucb:$PATH

  59. Kenneth Lareau Says:

    Just a note, as people keep saying ‘add /usr/ucb to your path’ – the default OpenSolaris install does NOT include most of the files in /usr/ucb (I know, as I did a fresh install from b98 just a short while ago); you need to do ‘pkg install SUNWscp’ and then you should find some of the common BSD-based versions of various programs there (including ‘ps’).

  60. links for 2008-10-28 « Bloggitation Says:

    [...] SmugBlog: Success with OpenSolaris + ZFS + MySQL in production! (tags: zfs solaris mysql database sysadmin tuning) Possibly related posts: (automatically generated)links for 2008-10-17 [...]

  61. Serge Says:

    Very informative.
    I am deeply glad that Sun purchased mySQL , just wait a bit, until we’ll get 5.1 , the road will be paved with better software , and enterprise ready. mySQL isn’t yet ready to compete with Oracle.

    I would like to see more VPS offers using ZFS and OpenSolaris.

  62. Blake Says:

    ALCON:

    I have a ZFS question, perhaps someone can help me out.

    I installed Solaris 10 on a sparc based machine onto a zfs root pool (single drive) of 136GB

    I have 3 other drives that total 136GB also, and I’d like to now setup a mirror of the root zfs partition using the 3 drives totally 136GB to mirror the one 136GB HD. Any ideas?

    Blake

  63. syngshin.com » Blog Archive » SmugBlog: Success with OpenSolaris + ZFS + MySQL in production! Says:

    [...] by ropiku to programming [link] [12 [...]

  64. khash Says:

    Running a production with MySQL and ZFS ? Of course !

    We are running MySQL 5.1.26-rc (!) successfully for 6 month now on a production (!) machine of type
    SPARC Enterprise T5220. The company is a 5000 people company and is among the top 5 of Germany’s IT Service providers.

    The JSP application (SGMJ, consisting of 2 Glassfish services, two MySQL database services) is used by approximately 100 Users every day and is mission critical. All services are integrated with smf.

    The app is running in a Solaris Zone, so it is totally isolated from currently 20 other heavily used virtual OS environments on this server.The performance of this web app is overwhelming.

    All this has been built at zero cost for licences. And even better: the whole application is using less than 0.1% to 0.2% of the available cpu ressources of the machine. So we get all this at nearly no cost. I doubt if any other architecture would compete in these 2 categories (cost-effectiveness and performance).

    In other words we can expect that even 1000 of these applications on the same server would perform considerably well.
    Unfortunately our customers need less than 1000 web applications. Currently only around 300. So using the above architecture we could expect to be using a third of this server for all of these ;-)

    Instead, however these other 300 applications are currently running on 600 or more Linux machines using JBosses, Weblogics, and another few hundret for Oracle DBs and so on. Costing a fortune of several million dollars a year for energy, cooling and space. All this, due to the lack of a supported virtualisation (like Zones), ZFS on Linux and due to the once modern (now old) idea of running all and everything on Intel servers.

    Cheers
    Karsten

  65. BigRussia Says:

    Мне нравятся Ваши посты, заставляет задуматься)

  66. Jan Holtzhausen Says:

    FYI
    Unscientific, stickit in excel and lookit graph indicates that -3 and -7 are your sweet spots

  67. jonny rocket Says:

    open sol is nice. real nice. but it uses too much memory.

  68. Eric Says:

    Fact that you require the GNU utilities shows that you are just expecting everything to be nice and like you like it, but in fact, every single part of Solaris/SunOS/OpenSolaris is far, far better than anything GNU has ever put out.

    The GNU people have been trying to make a bad clone of Solaris for decades.

  69. yamil Says:

    problemas de modem usb en open solaris alguien sabe algo de eso gracias

  70. kevin Says:

    It's all gabber-wockey to me, yet I've been in IT for twenty years.

  71. SYSTEMHELDEN.COM /* HELDENFunk Says:

    [...] CEO und Chief Geek Don McAskill probierte neulich OpenSolaris mit ZFS und MySQL aus – und liebt es. Aufgrund der großen Nachfrage, hat er gleich einen draufgelegt und Kompressions-Benchmarks [...]

  72. Tim Says:

    As a Linux user, you may not like the Solaris userspace commands, but we Solaris users loath the Gnu ones.

  73. SmugBlog: Don MacAskill » Blog Archive » Great things afoot in the MySQL community Says:

    [...] to these boxes (let alone reads). 10000+ write IOPS to 10TB of mirrored, crazy durable (thanks ZFS!) storage is a dream come true. Once you mix in snapshots, clones, replication, and Analytics – [...]

  74. Andre Says:

    "Dell MD3000 w/15 x 15K SCSI disks and mirrored 512MB battery-backed write caches (these are really starting to piss us off, but that’s another post…)"

    Any brief comments about the issues you were having with those MD3000s please? Thanks in advance!

  75. David Halko Says:

    When running as "root", the /usr/ucb/ps is nice since it will give you VERY LONG command lines, that are (unfortunately) unavailable using the /usr/bin/ps

    $ ps -eo user,pid,stime,args | nawk '!/nawk/ && /ivserver/ && Count<1 { Count+=1 ; print }'
    root 16599 Dec_31 /opt/InfoVista/Essentials/bin/ivserver -m /opt/InfoVista/Essentials/data/manage

    $ /usr/ucb/ps -auxww | nawk '!/nawk/ && /ivserver/ && Count<1 { Count+=1 ; print }'
    root 16599 0.5 1.4462144427136 ? S Dec 31 522:01 /opt/InfoVista/Essentials/bin/ivserver -m /opt/InfoVista/Essentials/data/manage

    # /usr/ucb/ps -auxww | nawk '!/nawk/ && /ivserver/ && Count<1 { Count+=1 ; print }'
    root 16599 0.5 1.3464640417320 ? S Dec 31 522:02 /opt/InfoVista/Essentials/bin/ivserver -m /opt/InfoVista/Essentials/data/manager_iv2.db -c /opt/InfoVista/Essentials/data/collector_iv2.db -print /opt/InfoVista/Essentials/log/collector_iv2.log -ep 42119 -lrip 0.0.0.0 -lrport 0 -rc /opt/InfoVista/Essentials/init/InfoVista_iv2

    I used the ucb version of PS since Solaris 1.x days – I wish Solaris 2.x would fix this so as to allow the super-wide output available under the ucb version!

    Until Solaris supports an equivalent of "ww" on the output line, there will always be a need for /usr/ucb/ps

  76. David Halko Says:

    I sympathize greatly.

    Having made the persona migration from BSD to SVR4 when moving to another vendor's UNIX platform, moving back to Solaris has been VERY nice, since I have the option of enjoying the best of both (BSD & SVR4) worlds now.

  77. David Halko Says:

    No joke!

    I want OpenSolaris (with X) on an appliance with 128 Meg of RAM (no, I can't swap memory chips!)… like the original release of Solaris 10.

  78. Eddy Says:

    I'd also like to here more about the issues with the MD3000.

  79. Jon Says:

    That is a pretty awesome setup. I thought that ext4 and LVM 2 were flexible, but it sounds like they're nothing compared to ZFS. I use RAID for personal use, mostly just backups and media storage. Just wondering how easy it is to setup a ZFS solution for personal use? If anybody is interested, I have a tutorial showing how to setup RAID 0, 1, 5, 6 or 10 on Linux with a GUI. This is a nice tutorial for novices to the open source world.

  80. ronaldo Says:

    mmm…thanks..

  81. chetanM Says:

    Cool , Is that possible to launch EC2-ami of Open+Solaries WITH mysql MASTER+SLAVE

  82. iphone ringtonemaker Says:

    Canon FTL. I'll never buy another one of their camera's. This seals it. Peace.