<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Optimizing Feature Extraction</title>
	<atom:link href="http://lingpipe-blog.com/2008/10/09/optimizing-feature-extraction/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2008/10/09/optimizing-feature-extraction/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Sat, 04 Feb 2012 20:56:48 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: uClassify blog &#187; More memory or smaller memories?</title>
		<link>http://lingpipe-blog.com/2008/10/09/optimizing-feature-extraction/#comment-16142</link>
		<dc:creator><![CDATA[uClassify blog &#187; More memory or smaller memories?]]></dc:creator>
		<pubDate>Sun, 16 Oct 2011 09:36:31 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=219#comment-16142</guid>
		<description><![CDATA[[...] A similar optimization for Java is described on the LingPipe blog. [...]]]></description>
		<content:encoded><![CDATA[<p>[...] A similar optimization for Java is described on the LingPipe blog. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon</title>
		<link>http://lingpipe-blog.com/2008/10/09/optimizing-feature-extraction/#comment-2937</link>
		<dc:creator><![CDATA[Jon]]></dc:creator>
		<pubDate>Sat, 18 Oct 2008 23:58:41 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=219#comment-2937</guid>
		<description><![CDATA[Thanks for the good explaination!]]></description>
		<content:encoded><![CDATA[<p>Thanks for the good explaination!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2008/10/09/optimizing-feature-extraction/#comment-2907</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Mon, 13 Oct 2008 18:04:53 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=219#comment-2907</guid>
		<description><![CDATA[In Java, Map is an interface, so we have to talk about specific implementations, like TreeMap and HashMap.   

Objects are the real killer, though.  Every object stores a reference (4 byte integer) as well as a pointer to the class itself (4 byte integer, I think), so you&#039;re in for 8 bytes/object minimum.   Strings store an underlying string (array of characters, meaning an object with associated pointer), as well as integer start/length, so you&#039;re in for at least 24 bytes with an empty string.

Then if you look at HashMap, it&#039;s structured as an array of Map.Entry implementations.   HashMap implements Map. Entry as linked lists, with a generic key, generic value, next value, and integer hash, so 16 bytes plus object overhead, so about 24 bytes.  

I think this can actually be worse in that some implementations store a 4-byte synchronization key on every object.  

And it&#039;s not just size -- all the hash lookups are extremely slow compared to the basic multiply/add operation of a linear classify.  

It&#039;s possible to make smaller map implementations, as I do implicitly with tries in the lm package and with and the util.SmallObjectToDouble map class in LingPipe.   (SmallSet is easier to understand and more general if anyone&#039;s looking for examples.)]]></description>
		<content:encoded><![CDATA[<p>In Java, Map is an interface, so we have to talk about specific implementations, like TreeMap and HashMap.   </p>
<p>Objects are the real killer, though.  Every object stores a reference (4 byte integer) as well as a pointer to the class itself (4 byte integer, I think), so you&#8217;re in for 8 bytes/object minimum.   Strings store an underlying string (array of characters, meaning an object with associated pointer), as well as integer start/length, so you&#8217;re in for at least 24 bytes with an empty string.</p>
<p>Then if you look at HashMap, it&#8217;s structured as an array of Map.Entry implementations.   HashMap implements Map. Entry as linked lists, with a generic key, generic value, next value, and integer hash, so 16 bytes plus object overhead, so about 24 bytes.  </p>
<p>I think this can actually be worse in that some implementations store a 4-byte synchronization key on every object.  </p>
<p>And it&#8217;s not just size &#8212; all the hash lookups are extremely slow compared to the basic multiply/add operation of a linear classify.  </p>
<p>It&#8217;s possible to make smaller map implementations, as I do implicitly with tries in the lm package and with and the util.SmallObjectToDouble map class in LingPipe.   (SmallSet is easier to understand and more general if anyone&#8217;s looking for examples.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon</title>
		<link>http://lingpipe-blog.com/2008/10/09/optimizing-feature-extraction/#comment-2902</link>
		<dc:creator><![CDATA[Jon]]></dc:creator>
		<pubDate>Sun, 12 Oct 2008 15:02:28 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=219#comment-2902</guid>
		<description><![CDATA[We had to do a similar optimization when we built our classification server. In C++ each STL map entry has an overhead of 28 bytes if I recall correctly, and strings in best case 4 bytes, worst case (strings longer than 16) 20 bytes. Using maps and strings for features was not possible for large frequency distributions - using an intermediate map for training which is &quot;compacted&quot; into raw memory (where we can do binary search) every now and then lowered memory consumption a lot. Do you know how much overhead maps have in Java?

Nice blog, keep up the good work! /Jon]]></description>
		<content:encoded><![CDATA[<p>We had to do a similar optimization when we built our classification server. In C++ each STL map entry has an overhead of 28 bytes if I recall correctly, and strings in best case 4 bytes, worst case (strings longer than 16) 20 bytes. Using maps and strings for features was not possible for large frequency distributions &#8211; using an intermediate map for training which is &#8220;compacted&#8221; into raw memory (where we can do binary search) every now and then lowered memory consumption a lot. Do you know how much overhead maps have in Java?</p>
<p>Nice blog, keep up the good work! /Jon</p>
]]></content:encoded>
	</item>
</channel>
</rss>

