<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: ZFS &amp; MySQL/InnoDB Compression Update</title>
	<atom:link href="http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/</link>
	<description>Thought stream from SmugMug's CEO &#38; Chief Geek</description>
	<lastBuildDate>Fri, 06 Nov 2009 22:21:22 -0800</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9-rare</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Ethan</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-104394</link>
		<dc:creator>Ethan</dc:creator>
		<pubDate>Mon, 12 Oct 2009 18:57:36 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-104394</guid>
		<description>@Guest: Not a limit in either, simply a requirement by Solaris (unlike Linux) to have virtual memory backing memory allocation. </description>
		<content:encoded><![CDATA[<p>@Guest: Not a limit in either, simply a requirement by Solaris (unlike Linux) to have virtual memory backing memory allocation.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Guest</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-104391</link>
		<dc:creator>Guest</dc:creator>
		<pubDate>Mon, 12 Oct 2009 02:10:24 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-104391</guid>
		<description>Ethan, 
Is it an allocation limit in Solaris, or in InnoDB? </description>
		<content:encoded><![CDATA[<p>Ethan,<br />
Is it an allocation limit in Solaris, or in InnoDB?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ethan</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-104232</link>
		<dc:creator>Ethan</dc:creator>
		<pubDate>Mon, 30 Mar 2009 17:14:11 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-104232</guid>
		<description>I know this is an old posting, but did you settle on using lzjb or gzip in production, for InnoDB data files?  Also, we ran into memory allocation limits in Solaris as well, and it turned out we just didn&#039;t have enough swap.  We found a need for ~50% swap -&gt; memory allocated.  So for a 48G system, with a 40G innodb_buffer_pool, we needed about 18-20G of swap. </description>
		<content:encoded><![CDATA[<p>I know this is an old posting, but did you settle on using lzjb or gzip in production, for InnoDB data files?  Also, we ran into memory allocation limits in Solaris as well, and it turned out we just didn&#039;t have enough swap.  We found a need for ~50% swap -&gt; memory allocated.  So for a 48G system, with a 40G innodb_buffer_pool, we needed about 18-20G of swap.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: roland</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-103764</link>
		<dc:creator>roland</dc:creator>
		<pubDate>Sat, 20 Dec 2008 04:29:53 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-103764</guid>
		<description>&gt;Someone really needs to find a (legal) way to use lzo compression with zfs, this is imho the real contender  
 
yes, indeed.  
 
i don`t know why this did not yet attract sun developers.  
regarding licensing, i think there is a way to proceed, as the author may (perhaps) give explicit permission to use lzo with zfs. there is even a commercial lzo implementation with some optimization. zfs+lzo would be &quot;win-win&quot; for sun and for lzo author. 
 
there exists a patch for zfs-fuse to enable lzo compression (just demonstration purpose) and the results are real interesting as lzo can give better performance and better compression than lzjb - see &lt;a href=&quot;http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg08219.html&quot; target=&quot;_blank&quot;&gt;http://www.mail-archive.com/zfs-discuss@opensolar...&lt;/a&gt;
 </description>
		<content:encoded><![CDATA[<p>&gt;Someone really needs to find a (legal) way to use lzo compression with zfs, this is imho the real contender  </p>
<p>yes, indeed.  </p>
<p>i don`t know why this did not yet attract sun developers.<br />
regarding licensing, i think there is a way to proceed, as the author may (perhaps) give explicit permission to use lzo with zfs. there is even a commercial lzo implementation with some optimization. zfs+lzo would be &quot;win-win&quot; for sun and for lzo author. </p>
<p>there exists a patch for zfs-fuse to enable lzo compression (just demonstration purpose) and the results are real interesting as lzo can give better performance and better compression than lzjb &#8211; see <a href="http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg08219.html" target="_blank"></a><a href="http://www.mail-archive.com/zfs-discuss@opensolar.." rel="nofollow">http://www.mail-archive.com/zfs-discuss@opensolar..</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: SYSTEMHELDEN.COM /* HELDENFunk</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-103701</link>
		<dc:creator>SYSTEMHELDEN.COM /* HELDENFunk</dc:creator>
		<pubDate>Mon, 08 Dec 2008 17:53:12 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-103701</guid>
		<description>[...] Smugmug&#8217;s CEO und Chief Geek Don McAskill probierte neulich OpenSolaris mit ZFS und MySQL aus - und liebt es. Aufgrund der großen Nachfrage, hat er gleich einen draufgelegt und Kompressions-Benchmarks gemacht. [...]</description>
		<content:encoded><![CDATA[<p>[...] Smugmug&#8217;s CEO und Chief Geek Don McAskill probierte neulich OpenSolaris mit ZFS und MySQL aus &#8211; und liebt es. Aufgrund der großen Nachfrage, hat er gleich einen draufgelegt und Kompressions-Benchmarks gemacht. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Scott C.</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-103551</link>
		<dc:creator>Scott C.</dc:creator>
		<pubDate>Tue, 28 Oct 2008 19:04:38 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-103551</guid>
		<description>Since ZFS compresses only one block at a time, the gzip blocks and tree structure discard mentioned by Jacques above do not apply.
For 16K blocks, what happens is a 16K data block (uncompressed size) is compressed with lzjb or gzip into a new block of smaller size, and stored on disk in the smaller size (which has to be a binary size, IIRC -- 8k, 4k, 2k, etc) and like all blocks in ZFS, there is a checksum elsewhere to make sure the data is ok.

LZJB, as mentioned by Jacques, is a LZ-77 style algorithm.  It is supposedly faster and a bit better at compression than the original, and versus the lzo suggested by Marc above.  Either way, its basically the same class of small-window streaming compression algorithm.

Gzip is also a streaming compression algorithm, and as mentioned above is basically just a LZ 77 followed immediately by huffman encoding.  Most of the time, a default huffman encoder is used and the dictionary isn&#039;t built up by analyzing the data first.  Gzip has a default dictionary that is used for almost all streams to avoid this step (or more correctly, deflate does)  If it is, it is done by analyzing a smaller chunk.  Either way for ZFS, this chunk size is at most the size of the file block.

Deflate/Gzip are RFC 1951, and 1952.  RFC 1950 is the related zlib wrapper format commonly used in coordination with deflate. (just search for RFC 1952 to find more about gzip than you ever want to know).

For info on LZJB, look at the source code in zfs :D  It is well commented from what I recall.</description>
		<content:encoded><![CDATA[<p>Since ZFS compresses only one block at a time, the gzip blocks and tree structure discard mentioned by Jacques above do not apply.<br />
For 16K blocks, what happens is a 16K data block (uncompressed size) is compressed with lzjb or gzip into a new block of smaller size, and stored on disk in the smaller size (which has to be a binary size, IIRC &#8212; 8k, 4k, 2k, etc) and like all blocks in ZFS, there is a checksum elsewhere to make sure the data is ok.</p>
<p>LZJB, as mentioned by Jacques, is a LZ-77 style algorithm.  It is supposedly faster and a bit better at compression than the original, and versus the lzo suggested by Marc above.  Either way, its basically the same class of small-window streaming compression algorithm.</p>
<p>Gzip is also a streaming compression algorithm, and as mentioned above is basically just a LZ 77 followed immediately by huffman encoding.  Most of the time, a default huffman encoder is used and the dictionary isn&#8217;t built up by analyzing the data first.  Gzip has a default dictionary that is used for almost all streams to avoid this step (or more correctly, deflate does)  If it is, it is done by analyzing a smaller chunk.  Either way for ZFS, this chunk size is at most the size of the file block.</p>
<p>Deflate/Gzip are RFC 1951, and 1952.  RFC 1950 is the related zlib wrapper format commonly used in coordination with deflate. (just search for RFC 1952 to find more about gzip than you ever want to know).</p>
<p>For info on LZJB, look at the source code in zfs <img src='http://blogs.smugmug.com/don/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' />   It is well commented from what I recall.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacques Chester</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-103531</link>
		<dc:creator>Jacques Chester</dc:creator>
		<pubDate>Wed, 22 Oct 2008 06:02:49 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-103531</guid>
		<description>Just to follow up my previous remark.

LZJB seems to follow the basic scheme in Lempel-Ziv 77 (LZRW1 actually) with some modifications. It has a moving window through which data streams. When an item comes in, the algo checks to see if it&#039;s been seen. If it has, the element is replaced with a pointer to the earliest instance in the window. Otherwise it&#039;s added to a dictionary of such addresses and the algorithm moves to the next item.

This is ideal for filesystem compression for two reasons:

1. You don&#039;t need to store header data for the decompressor. All the data needed to decompress the data is already in situ in the data.
2. You don&#039;t need to see all the data to begin compressing. As soon as you have data streaming in, you can begin to compress it.

By contrast gzip uses an algorithm called Deflate, which is actually two algorithms run in series. The first step is a Lempel-Ziv 77 pass like the one in LZJB, followed by a second pass of Huffman coding, where repeated blocks are replaced with short symbolic codes. This involves a tree structures being built and then discarded every 64k or so, on top of the pass already made by the Lempel-Ziv algorithm.

I think that accounts for the observed differences in performance.</description>
		<content:encoded><![CDATA[<p>Just to follow up my previous remark.</p>
<p>LZJB seems to follow the basic scheme in Lempel-Ziv 77 (LZRW1 actually) with some modifications. It has a moving window through which data streams. When an item comes in, the algo checks to see if it&#8217;s been seen. If it has, the element is replaced with a pointer to the earliest instance in the window. Otherwise it&#8217;s added to a dictionary of such addresses and the algorithm moves to the next item.</p>
<p>This is ideal for filesystem compression for two reasons:</p>
<p>1. You don&#8217;t need to store header data for the decompressor. All the data needed to decompress the data is already in situ in the data.<br />
2. You don&#8217;t need to see all the data to begin compressing. As soon as you have data streaming in, you can begin to compress it.</p>
<p>By contrast gzip uses an algorithm called Deflate, which is actually two algorithms run in series. The first step is a Lempel-Ziv 77 pass like the one in LZJB, followed by a second pass of Huffman coding, where repeated blocks are replaced with short symbolic codes. This involves a tree structures being built and then discarded every 64k or so, on top of the pass already made by the Lempel-Ziv algorithm.</p>
<p>I think that accounts for the observed differences in performance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-103528</link>
		<dc:creator>Steve</dc:creator>
		<pubDate>Mon, 20 Oct 2008 15:02:50 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-103528</guid>
		<description>Did you set the ZFS record size to 16kb before doing this? The MySQL docs say the ZFS record size should match the InnoDB record size of 16kb.

http://dev.mysql.com/tech-resources/articles/mysql-zfs.html#Set_the_ZFS_Recordsize_to_match_the_block_size

If not, I would love to see the results with the record size set to 16kb.</description>
		<content:encoded><![CDATA[<p>Did you set the ZFS record size to 16kb before doing this? The MySQL docs say the ZFS record size should match the InnoDB record size of 16kb.</p>
<p><a href="http://dev.mysql.com/tech-resources/articles/mysql-zfs.html#Set_the_ZFS_Recordsize_to_match_the_block_size" rel="nofollow">http://dev.mysql.com/tech-resources/articles/mysql-zfs.html#Set_the_ZFS_Recordsize_to_match_the_block_size</a></p>
<p>If not, I would love to see the results with the record size set to 16kb.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ricardo Correia</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-103524</link>
		<dc:creator>Ricardo Correia</dc:creator>
		<pubDate>Sat, 18 Oct 2008 12:03:44 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-103524</guid>
		<description>Did you run &quot;sync&quot; at the end of each &quot;cp&quot; command and waited for the sync to complete before collecting the elapsed time?

Note that ZFS only compresses data when it&#039;s actually writing to the disks, which typically only happens every 5 seconds given enough RAM. So if you didn&#039;t sync at the end, the first 2 tables are not really useful for compression comparison purposes and the other ones are very suspect :)

Also, if you used sync at the end, the average of the 10 runs would be more representative of the actual performance than only taking the fastest run.

As for the different checksum options, I see no difference in performance when enabling/disabling the default checksum algorithm, but I noticed that in zfs-fuse the SHA1 checksum is way too heavy on the CPUs (especially considering all the context switching going on), but it&#039;s possible that this isn&#039;t so bad in Solaris/OpenSolaris.</description>
		<content:encoded><![CDATA[<p>Did you run &#8220;sync&#8221; at the end of each &#8220;cp&#8221; command and waited for the sync to complete before collecting the elapsed time?</p>
<p>Note that ZFS only compresses data when it&#8217;s actually writing to the disks, which typically only happens every 5 seconds given enough RAM. So if you didn&#8217;t sync at the end, the first 2 tables are not really useful for compression comparison purposes and the other ones are very suspect <img src='http://blogs.smugmug.com/don/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Also, if you used sync at the end, the average of the 10 runs would be more representative of the actual performance than only taking the fastest run.</p>
<p>As for the different checksum options, I see no difference in performance when enabling/disabling the default checksum algorithm, but I noticed that in zfs-fuse the SHA1 checksum is way too heavy on the CPUs (especially considering all the context switching going on), but it&#8217;s possible that this isn&#8217;t so bad in Solaris/OpenSolaris.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacques Chester</title>
		<link>http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/comment-page-1/#comment-103512</link>
		<dc:creator>Jacques Chester</dc:creator>
		<pubDate>Wed, 15 Oct 2008 05:14:29 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.smugmug.com/don/?p=425#comment-103512</guid>
		<description>I think the difference is that LZJB is a compression algorithm designed for streaming data -- it doesn&#039;t rely on a full scan of the data to perform its compression. The gzip algorithm builds a dictionary first, so that explains the big slowdowns on larger datasets.

Obviously building a dictionary of items for compression purposes lets you do things you can&#039;t do otherwise, which is why gzip is better for total compression ratio.

A good followup test would be to see performance on a random-access basis. What happens with lots of writes and updates? I&#039;m not sure if gzip requires a from-the-top effort or not. I imagine Jon Bonwick could tell you.

Of course, it could be that I&#039;m an undergraduate who knows not whereof he speaks. YMMV.</description>
		<content:encoded><![CDATA[<p>I think the difference is that LZJB is a compression algorithm designed for streaming data &#8212; it doesn&#8217;t rely on a full scan of the data to perform its compression. The gzip algorithm builds a dictionary first, so that explains the big slowdowns on larger datasets.</p>
<p>Obviously building a dictionary of items for compression purposes lets you do things you can&#8217;t do otherwise, which is why gzip is better for total compression ratio.</p>
<p>A good followup test would be to see performance on a random-access basis. What happens with lots of writes and updates? I&#8217;m not sure if gzip requires a from-the-top effort or not. I imagine Jon Bonwick could tell you.</p>
<p>Of course, it could be that I&#8217;m an undergraduate who knows not whereof he speaks. YMMV.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
