<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Kevin Dempsey Peterson</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/" />
    <link rel="self" type="application/atom+xml" href="http://kdpeterson.net/blog/atom.xml" />
    <id>tag:kdpeterson.net,2009-04-16:/blog/2</id>
    <updated>2010-07-23T05:03:45Z</updated>
    <subtitle>Software engineer at kaChing by day, treasurer San Mateo County Libertarian Party by night, runner on the weekends.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type 4.25</generator>

<entry>
    <title>Murder Your Darlings</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2010/07/murder-your-darlings.html" />
    <id>tag:kdpeterson.net,2010:/blog//2.129</id>

    <published>2010-07-23T05:02:24Z</published>
    <updated>2010-07-23T05:03:45Z</updated>

    <summary>Lately I&apos;ve been working on connectivity with NASDAQ. The protocols involve parsing fixed-offset messages of varios types. We&apos;re not doing high frequency trading so we are optimizing for programmer efficiency -- that is, the API I expose to the rest...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="codequality" label="code quality" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="java" label="java" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>Lately I've been working on connectivity with NASDAQ. The protocols involve parsing fixed-offset messages of varios types. We're not doing high frequency trading so we are optimizing for programmer efficiency -- that is, the API I expose to the rest of the system should make sense, so I'm representing the different types of messages, trading conditions, exchange identifiers and so on as enums. I was working on processing incoming messages, in this case, implementing a handle for NASDAQ's SoupTCP protocol. The incoming message has a one-character code which I translate. I've seen programmers code this kind of thing using a big lookup table, but that leads to maintainability problems -- when you add an enum value, did you remember to add it to the case statement? Did that case statement get copy and pasted elsewhere? The better solution is to embed that logic in the enum itself using a static map and a factory method.</p>

<pre class="brush: java;tab-size: 2; smart-tabs: true; toolbar: false; gutter: false; first-line:1;">enum SoupMessageType {
  LOGIN_REQUEST('L'),
  LOGIN_ACCEPT('A'),
  LOGIN_REJECT('J'),
  DATA('S'),
  LOGOUT_REQUEST('O');

  private char code;

  private SoupMessageType(char code) {
    this.code = code;
  }

  private static final Map&lt;Character, SoupMessageType&gt; map;
  static {
    map = new HashMap&lt;Character, SoupMessageType&gt;(values().length);
    for (SoupMessageType v : values()) {
      map.put(v.code(), v);
    }
  }

  public char code() {
    return code;
  }

  public SoupMessageType from(char code) {
    return map.get(code);
  }
}</pre>

<p>This is a simple pattern, and I found myself copying it from another enum. Since copy and paste is bad, I started looking for how to turn this pattern into an abstraction. First, I'd move the static code block into the constructor for a map-like class:</p>

<pre class="brush: java;tab-size: 2; smart-tabs: true; toolbar: false; gutter: false; first-line:1;">public class CodedEnumer&lt;K, E extends Enum&lt;E&gt; & CodedEnum&lt;K&gt;&gt; {
  private Map&lt;K, E&gt; map;
  public CodedEnumer(Class&lt;E&gt; klass) {
    E[] enumConstants = klass.getEnumConstants();
    map = new HashMap&lt;K, E&gt;(enumConstants.length);
    for(E v : enumConstants) {
      map.put(v.code(), v);
    }
  }

  public static &lt;K, V extends Enum&lt;V&gt; & CodedEnum&lt;K&gt;&gt;
    CodedEnumer&lt;K, V&gt; create(Class&lt;V&gt; klass) {
    return new CodedEnumer&lt;K, V&gt;(klass);
  }

  public E get(K key) {
    return map.get(key);
  }
}</pre>

<p>The enum needs to implement a CodedEnum interface with one method.</p>

<pre class="brush: java;tab-size: 2; smart-tabs: true; toolbar: false; gutter: false; first-line:1;">public interface CodedEnum&lt;K&gt; {
  public K code();
}</pre>

<p>My first draft of this included another type parameter E for the enum, and a public E from(K key) method. But of course this method should be static, and declaring a static method in an interface would be meaningless (aside from the other detail of being a compiler error).</p>

<p>Now, rather than copy and paste building the mapping from code to value, the enum needs to implement CodedEnum, create an instance of the CodedEnumer, and use that to implement the one-line static method.</p>

<pre class="brush: java;tab-size: 2; smart-tabs: true; toolbar: false; gutter: false; first-line:1;">enum SoupMessageTypeCoded implements CodedEnum&lt;String, SoupMessageTypeCoded&gt; {
  LOGIN_REQUEST('L'),
  LOGIN_ACCEPT('A'),
  LOGIN_REJECT('J'),
  DATA('S'),
  LOGOUT_REQUEST('O');

  private String code;

  private SoupMessageTypeCoded(String code) {
    this.code = code;
  }

  private static final CodedEnumer&lt;String, SoupMessageTypeCoded&gt; map =
    CodedEnumer.create(SoupMessageTypeCoded.class);

  @Override
  public String code() {
    return code;
  }

  public static SoupMessageTypeCoded from(String code) {
    return map.get(code);
  }
}</pre>

<p>Pretty slick, huh? I was pretty pleased with myself when I actually found a use for intersection in a generic declaration. This is where the old writer's advice of "murder your darlings" comes into play. It's various attributed to Fitzgerald, Hemmingway or others, but the meaning is that whenever you write a particularly clever turn of phrase, whatever makes you smile at how smart you are, get out the red pencil or delete key, and get rid of it.</p>

<p>On sober reflection, this code sucks. I've created two extra types with complicated generics to save two or three lines of code. Anyone who opens up the class in the future will have to open two more files to understand how it works and what it's doing. So I struck it out, and reverted to the version with those three horrible lines wastefully repeated in each and every enum I use this pattern in. I can only console myself that disk space is getting cheaper.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Inside Automated Sentiment Analysis</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2010/03/sentiment-analysis-on-biz360-blog.html" />
    <id>tag:kdpeterson.net,2010:/blog//2.128</id>

    <published>2010-03-25T23:05:17Z</published>
    <updated>2010-07-23T05:09:37Z</updated>

    <summary>This post details Biz360&apos;s automated sentiment analysis system, including our goals, how the system works, how we measure success, and the ways it can be used and misused. Before getting into the how or why, I want to start with...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="sentiment" label="sentiment" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="work" label="work" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>This post details <a href="http://ci.biz360.com" target="_blank">Biz360</a>'s automated sentiment analysis system, including our goals, how the system works, how we measure success, and the ways it can be used and misused. Before getting into the <em>how</em> or <em>why</em>, I want to start with the <em>what</em>. For our purposes, sentiment is the opinion of the author of an article towards the subject of an article. We classify sentiment into four possible categories.</p> 
<h3>Positive</h3> 
<dl> 
<dd>Arguing <em>for</em> something, saying something is a <em>good</em> product, talking about good things a person or company has done, enjoying something, liking something, preferring something. If a mostly positive post has a small portion that is negative, it is still <em>positive</em>.</dd> 
</dl> 
<h3>Negative</h3> 
<dl> 
<dd>Arguing <em>against</em> something, saying something is a bad product, a bad experiences, talking about bad things a person or company has done, disliking or having problems with something. If a mostly negative post has a small portion that is positive, it is still negative</dd> 
</dl> 
<h3>Neutral</h3> 
<dl> 
<dd>If an post doesn't express any opinion, doesn't present anyone or anything in a favorable or unfavorable way, and wouldn't lead someone to form an opinion for or against, it is <em>neutral</em>.</dd> 
</dl> 
<h3>Mixed</h3> 
<dl> 
<dd>If an post is both positive and negative, such as saying something was good in some ways but bad in others, or if the post talks about different subjects and is positive toward one subject but negative to another, then rate the post as <em>mixed</em>.</dd> 
</dl> 
<p>The first question is why do you need automated sentiment. The simple answer is that there's just too much content. As conversations that used to take place over coffee and on street corners move to Twitter and forums, they become trackable. If a magazine with 100,000 readers mentions you in an article, you'll read that article and discuss what it's saying about you. If 10,000 people tell ten of their friends what they think of Kevin Smith vs. Southwest Air, you can't hope to read more than a small sampling. It's this later use case that we cared about.</p> 
<ol> 
<li>What portion of my coverage is positive, negative, etc?</li> 
<li>I got a spike in coverage on Monday. Was that spike positive or negative?</li> 
<li>What kinds of things are people saying that's positive? Negative?</li> 
</ol> 
<p>We knew from the start that accuracy on the individual article level was never going to be that good. That is, if you want to know what the sentiment for some particular article is, the best thing to do is click on it, read it, and form your own opinion. With the help of <a title="Bill MacCartney's page at Stanford" href="http://nlp.stanford.edu/~wcmac/" target="_blank">Bill MacCartney</a>, an NLP researcher from Stanford, we quickly honed in on the following design parameters:</p> 
<ol> 
<li>A statistical classification system using two classifiers to detect positive and negative, and another classifier to combine these results. We would start with a simple <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes classifier</a> and a <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">Decision Tree classifier</a> to get everything working, and experiment with <a title="Wikipedia on Statistical Classifiers" href="http://en.wikipedia.org/wiki/Statistical_classification#Algorithms" target="_blank">more advanced classifiers</a> like the Linear MaxEnt classifier once we had a baseline to measure improvements.</li> 
<li>The system would be trained using lots of data from <a title="Mechanical Turk" href="https://www.mturk.com/" target="_blank">Mechanical Turk</a>. Each item would be rated multiple times so we could throw out the results from raters who didn't understand or were not taking enough care.</li> 
<li>Our training data would be real social media content, drawn from all the types of social media we process (blogs, micro-blogs, etc).</li> 
</ol> 
<p>At a very high level view, text classification systems get lumped into groups based on whether they are based on statistical learning from data, or whether they are based on hand-coded rules. Our system is solidly in the statistical camp. We were skeptical that a rule-based system could encompass the wide variety of topics and writing styles and the frequency of ungrammatical or misspelled content on the less formal parts of the Internet.</p> 
<p>Our sentiment engine turns each post into a set of features, like ("good", "deal") -&gt; 2, meaning the word "good" followed by the word "deal" occurs twice. This gets fed into a two-stage system. First, everything gets flagged for how positive it is (regardless of also being negative) and for how negative it is (regardless of how positive it is). Next, these get combined into the four categories that are displayed. So high positive sentiment and low negative sentiment would be <em>positive</em>, and high positive <em>and</em> high negative would be <em>negative</em>.</p> 
<p>We really wanted a <em>mixed</em> category, because in terms of whether it's a post worth reading, someone who is saying both good and bad things about you is even more interesting than positive or negative. Consider the following three clips:</p> 
<ol> 
<li>I love my Kinesis Maxim keyboard, it's the best. My wrists feel great since I've been typing on it.</li> 
<li>Kinesis is stupid, the Maxim has a stupid layout. I had one for a while but I threw it out.</li> 
<li>I like my Kinesis Maxim, but the left <em>alt</em> key is too small and too far to the left.</li> 
</ol> 
<p>Sure, the first one is what you hope everyone is saying, but reading these doesn't provide much value. The second one at least is an opportunity for damage control, but the third one is the real gold. In a system based on just a range from negative through neutral to positive, the positive and negative would cancel out and this kind of thing would get lumped into the neutral bucket.</p> 
<p>This kind of statistical system isn't any good without good data, so we used an approach that gives us lots of good data quickly and cheaply. We sent out thousands of clips to Mechanical Turk, Amazon's "artificial artificial intelligence" where they were scored by ten humans each. The instructions they were given were exactly the definitions I gave above. Those aren't just descriptions of what we think the system produces, those are the starting point. When the results came back, the humans didn't always agree, and some agreed more than others. We threw out the ones who looked like they just didn't understand the problem at all or were clicking randomly since payment was per item. Of the remaining items, we still got disagreements, so we took the majority, so that if five people said <em>positive</em>, three said <em>neutral</em> and two said <em>mixed</em>, we'd used that clip as training data for <em>positive</em>. All of our data was real social media data. We evaluated one off-the-shelf solution which was trained on newspaper data, and when it said that "Comcast sucks!" was neutral, we gave up on that idea.</p> 
<p>To evaluate our accuracy, we looked at a whole slew of numbers. We used a technique called k-fold cross validation, which means that we'd hold back some of our human-annotated data to use to evaluate how accurate the system is. A big challenge was that most of the content we got was neutral or positive, not mixed or negative. This makes it hard to use simple accuracy as the only metric. That is, if I have 90 items that should be classified as <em>A</em> and 10 items that should be classified at <em>B</em>, I could be 90% accurate by just saying everything was <em>A</em>. So I looked at the accuracy rates for each of the categories separately, and tried to balance them. Given my example with 90 <em>A</em> and 10 <em>B</em>, if I could get 90% accuracy, I'd really prefer 81 out of 90 <em>A</em>s classified correctly and 9 out of the 10 <em>B</em>s.</p> 
<span class="mt-enclosure mt-enclosure-image" style="display: inline;"><img alt="sentiment-breakdown-300x229.jpg" src="http://kdpeterson.net/blog/images/sentiment-breakdown-300x229.jpg" width="300" height="229" class="mt-image-right" style="float: right; margin: 0 0 20px 20px;" /></span> 
<p>Of course, there's no "make mistakes evenly" button to press, but I think we found a combination that gives useful results. You can see in the attached chart of predicted vs. human-annotated sentiment that the errors are evenly spread across the categories. This illustrates what we mean when we say that sentiment, though it is only correct for about 2/3 of the individual items, is <em>directionally</em> accurate. If the system finds 100 articles for for a topic, and says 50 of them are <em>positive</em>, a lot of those will be wrong. Maybe you go through them and you see that 10 of them were <em>neutral</em>, four <em>negative</em> and one <em>mixed</em>. But when you go to the other categories, you'll find that the errors mostly balance out. Some of the <em>neutral</em> should have been <em>positive</em>, and so on. So maybe there should have been 52 <em>positive</em>.</p> 
<p>There's a strong temptation when building an automated sentiment system to treat <em>neutral</em> as "I'm not sure". Computers make different kinds of mistakes than humans, and when the computer screws up something a human would have no trouble classifying correctly, it erodes confidence. The problem with this approach is that it focuses too much on not being wrong, and not enough on being right. If uncertain posts are rated as <em>neutral</em>, it changes the whole distribution of content. If you look at a topic and 75% of the content is "neutral", how much is really neutral and how much is swept under the rug because it didn't cross a confidence threshold? We treat <em>neutral</em> as just another category. To classify something that should be <em>positive</em>, <em>negative</em>, or <em>mixed</em> as <em>neutral</em> is just as incorrect as vice-versa.</p> 
<p>I hope this has given you some insight into how Biz360's sentiment engine works, and lets you make better sense of the numbers you are seeing, or, if you are still comparing solutions, gives you things to look for and questions to ask. I'll be following this up in the future with another article explaining "entity" or topic-based sentiment.</p> <p>This article was original posted on&nbsp;<a href="http://blog.biz360.com/2010/03/inside-automated-sentiment-analysis/">Biz360's Blog</a></p>]]>
        
    </content>
</entry>

<entry>
    <title>How to roll back a committed change in SVN</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2010/03/how-to-roll-back-a-committed-change-in-svn.html" />
    <id>tag:kdpeterson.net,2010:/blog//2.127</id>

    <published>2010-03-09T17:41:43Z</published>
    <updated>2010-03-09T17:48:50Z</updated>

    <summary>A coworker asked me today how to roll back a change that has been committed to SVN. This isn&apos;t obvious and the top google searches return irrelevant results. To back out or roll back a change that has already been...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="svn" label="svn" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>A coworker asked me today how to roll back a change that has been committed to SVN. This isn't obvious and the top google searches return irrelevant results. To back out or roll back a change that has already been committed to the Subversion repository, you first merge your commit in reverse, and then you commit. That is, in change 2918, you committed some config files that should not be there. Do this:</p>

<pre>% cd config
% svn merge -c -2918 ^/project/trunk/config
% svn ci -m 'revert checkins to config'</pre>

<p>This is covered in more detail in <a href="http://svnbook.red-bean.com/nightly/en/svn.branchmerge.basicmerging.html#svn.branchmerge.basicmerging.undo">Undoing Changes</a> section of the documentation.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Handy way to monitor multiple HBase logs</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2010/03/handy-way-to-monitor-multiple-hbase-logs.html" />
    <id>tag:kdpeterson.net,2010:/blog//2.126</id>

    <published>2010-03-02T22:35:42Z</published>
    <updated>2010-03-03T01:24:00Z</updated>

    <summary>We&apos;ve been running HBase at Biz360 for about six months, but it worked so smoothly at first that I never did much tuning. I&apos;ve recently increased the volume of data we&apos;re storing by about 300%, and have started running into...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="hbase" label="hbase" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>We've been running HBase at Biz360 for about six months, but it worked so smoothly at first that I never did much tuning. I've recently increased the volume of data we're storing by about 300%, and have started running into some problems like blocks going missing. I found a good way to monitor all my servers logs to get an idea what's going on today, and it's so simple I wanted to share.</p>

<pre>
for x in 01 02 03 04 05 06 07 08 09 10 ; do
    server=prod-hbase$x
    echo === $server ===
    ssh $server tail -30f '/app/hbase/logs/*region*.log'
done
</pre>

<p>It does a tail -f on each server's log in turn. To move to the next server, just hit ctrl-c.</p>

<p>@squarecog suggests</p>

<pre>
for server in `cat hbase_servers.txt`; do ...
</pre>
]]>
        

    </content>
</entry>

<entry>
    <title>You, right there, go call 911</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2010/01/you-there-call-911.html" />
    <id>tag:kdpeterson.net,2010:/blog//2.125</id>

    <published>2010-01-19T21:53:52Z</published>
    <updated>2010-01-19T22:11:48Z</updated>

    <summary>One item that stuck in my mind during a first aid class during boot camp was how to direct someone to call 911.Wrong:&quot;Somebody call 911!&quot;Right:&quot;You, in the black shirt, go call 911.&quot;If something isn&apos;t the responsibility of a particular person,...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="softwaredevelopment" label="software development" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[One item that stuck in my mind during a first aid class during boot camp was how to direct someone to call 911.<br /><br /><b>Wrong:</b><br />"Somebody call 911!"<br /><br /><b>Right:</b><br />"You, in the black shirt, go call 911."<br /><br />If something isn't the responsibility of a particular person, it will not get done. Everyone assumes someone else will do it.<br /><br />The same thing applies to software, and especially maintenance issues. If you say "hey, guys, looks like the app is down", it's going to stay that way. An open ticket not assigned to anyone is only useful if that is implicitly assigned to the product manager to review for the next iteration. If the app is down, then the right action is to open a ticket, and say, "hey, the app is down, it looks like the database, Joe, that's your area, here, it's your ticket". If Joe determines that the database is down because someone tripped over the power cord, then Joe can assign the ticket to Larry in ops, but every step of the way, some one particular person is responsible for that task.<br /> ]]>
        
    </content>
</entry>

<entry>
    <title>Hadoop Workflow Tools Survey </title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.124</id>

    <published>2009-11-19T10:25:36Z</published>
    <updated>2009-11-19T10:29:14Z</updated>

    <summary>Hadoop Map Reduce and HDFS are fairly stable pieces of software. One component that doesn&apos;t have a clear winner yet is higher level job scheduling, also known as workflow scheduling. To put this in context for someone who isn&apos;t familiar...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="hadoop" label="hadoop" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>Hadoop Map Reduce and HDFS are fairly stable pieces of software. One
component that doesn't have a clear winner yet is higher level job
scheduling, also known as workflow scheduling.</p>

<p>To put this in context for someone who isn't familiar with Hadoop, a
single Hadoop job is broken up into many map and reduce tasks. The
scheduler runs on the job tracker and assigns tasks to open slots on
the task trackers on the worker nodes. When we talk about the
scheduler in Hadoop, this is usually what we are talking about. By
default, Hadoop uses a FIFO scheduler, but there are two more advanced
schedulers which are widely used. The Capacity Scheduler is focused on
guaranteing that various users of a cluster will have access to their
guaranteed number of slots while making it and the Fair Scheduler is
focused on providing good latency for small jobs while long running
large jobs share the same cluster. These schedulers closely parallel
processor scheduling, with hadoop jobs corresponding to processes and
the map and reduce tasks corresponding to time slices.</p>

<p>The next level up is workflow scheduling -- starting jobs on a cluster
in the right order and with dependencies. Sometimes a single
map-reduce job is all you need. More frequently, you will have many
jobs with dependencies between them. For example, you might want to
identify the most important words in each document using <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">term
frequency&ndash;inverse document frequency</a>, which requires
first calculating the inverse document frequency then making use of
that while examining the documents again. In this case, a shell script
that runs the first job, waits for it to complete and then starts the
second will work.</p>

<p>Once you go down this path, you start running into
difficulties. Perhaps job C depends on job A and job B, but it's fine
for A and B to run in parallel. If D depends on B and C, and B and C
depend on A, and B fails part way through, how do you recover? It's
not a particularly hard problem, but it's enough of a problem that
we'd like to not reinvent the wheel. After all, while people use
Hadoop for different tasks, this workflow scheduling problem is common
to everyone.</p>

<p>I recently sent out a poll to the Hadoop mailing list to see how
people are solving this problem. </p>
]]>
        <![CDATA[<p>The first question was what are you using to manage your jobs.</p>

<ul>
<li>Five people are using shell scripts</li>
<li>Three people are using a homegrown system.</li>
<li>Five people are using a higher level abstraction like Pig, Hive, or Cascading.</li>
<li>One person reported using Oozie.</li>
<li>One person reported using Opswise.</li>
<li>One person is using Amazon Elastic Map Reduce.</li>
</ul>

<p>The next question mostly applied to those using shell scripts or a
homegrown system and asked how these systems interacted with Hadoop.</p>

<ul>
<li>Four people are using in-house systems built on top of JobControl.</li>
<li>One person is using an in-house system that uses Job.</li>
<li>One person is directly submitting the job as XML.</li>
</ul>

<p>I also asked whether people were happy with the tools they are using
(whether homegrown or off the shelf)</p>

<ul>
<li>4 people are very happy with their system and would recommend it to others.</li>
<li>5 people have some headaches but aren't actively looking to replace it.</li>
<li>7 are not happy with their system and would like to replace it.</li>
</ul>

<p>For those who want to look at the raw data (as sparse as it is), I've
posted it to a <a href="http://spreadsheets.google.com/ccc?key=0AkSgUqAZJOp-dFctWE9QdWhoc1M4TlZSQTNlNXdYVHc&amp;hl=en">google document</a>. What was most
interesting is that of the people using a homegrown system, only one
said they were at all happy with it, and none would recommend their
system. A majority of those using a higher level abstraction would
recommend their system to others. Before taking the poll, I worried
that I was doing things wrong, that there was some simple clear
solution that everyone was adopting. The opposite was true: any
combination I could come up with, there was someone out there who had
actually done it that way.</p>

<p>There's a continuum from just running a Hadoop job by typing
<code>bin/hadoop jar ...</code> or putting it into crontab up through more
complicated systems like what we have been using at Biz360 that
involve scripts to figure out batch numbers and parameters and then
start Java processes that may run multiple jobs using JobControl. The
only person using a homegrown system who said it was acceptable is
using something based on JobControl. JobControl is included with
Hadoop and simply helps to manage dependencies between Jobs. Rather
than keeping track on your own that you can start job A and job B, and
need to wait for them both to finish before starting C, you can add
them both to a JobControl and run it. Dependencies can only be between
jobs &ndash; you can't have a task to move files around depend on jobs
or a job depend on whether a directory is empty or anything like
that. This is client side, so the process that started the JobControl
will need to keep a thread running. You can detect errors, and it will
stop running jobs when a dependency can't be satisfied, but there's no
way to recover from errors. If you want to retry jobs, you need to
handle that yourself.</p>

<p>Another popular option, and the one that seems to have the most happy
users, is a higher level abstraction that runs on top of Hadoop like
<a href="http://www.cascading.org/">Cascading</a>, <a href="http://hadoop.apache.org/pig/">Pig</a> or <a href="http://hadoop.apache.org/hive/">Hive</a>. These share many common features.</p>

<ul>
<li>Flow of data expressed using concepts more expressive than
Map-Reduce, such as filters and grouping operators.</li>
<li>Job optimizer step to translate the higher level data flow into
discrete Hadoop Map-Reduce jobs.</li>
<li>A scheduling engine that can run the entire workflow.</li>
<li>Extension points to insert operations that aren't covered by the
system's primitives.</li>
</ul>

<p>There are significant differences. Hive's uses SQL to express the
workflow, Pig has its own language called Pig Latin, while Cascading
is written in Java or Groovy. User defined functions in Pig are very
much an extension to Pig, compared to Cascading where it's possible to
create a Flow directly from a Hadoop JobConf. Hive specifically
targets integration with SQL based tools. All of these to some extend
insulate the user from the Hadoop concept of jobs, replacing it with
something else. For our purposes at Biz360, the ability to plug
existing Hadoop jobs unchanged into a Cascade would make it the best
choice if we wanted to go this route. Other users may find the simple
query language Pig or Hive compelling.</p>

<p>One more specialized tool is <a href="http://code.google.com/p/hamake/">Hamake</a>, named for and inspired by
<code>make</code>. This allows you to express your jobs as dependencies on data
so that you can run only those portions of a complex pipeline that are
not up to date. I should note that Cascading also has this functionality.</p>

<p>None of these really suit our purposes though. I'd prefer to stick to
writing Hadoop map-reduce jobs, not Pig Latin or Cascading
Flows. Understanding where my data goes when I have a Hadoop Reducer
writing to a SolrOutputFormat that writes to an instance of Solr
running on the same node as the Reducer is right at the edge of what I
can keep track of. If I introduce another layer of indirection, I
would get hopelessly confused.</p>

<p>There are tools related to scheduling map reduce jobs and assembling
them into workflows. Amazon's [Elastic Map Reduce][EMR] has one
tool, but this is tied to their particular service. There's Cloudera
Desktop, which offers some basic job scheduling functionality, but
this doesn't yet offer much functionality for workflow
scheduling. I'm not sure what functionality Opswise offers as far as
Hadoop scheduling goes. I'd never heard of it before someone mentioned
it in the poll.</p>

<p>Some I'm going to list what I think are the killer features to see in
a Hadoop workflow scheduling system:</p>

<ul>
<li>Schedule both map reduce jobs and other actions like copying a file
from the local filesystem, or testing to ensure that a directory has
60 files in it.</li>
<li>Express a directed acyclic graph of dependencies between jobs and
actions. (Loops would be nice, but I don't need them.)</li>
<li>Full access to set arbitrary input formats, output formats, mapper,
reducer, and combiner classes.</li>
<li>Ability to drop into Java code when needed with some sort of "postconfigure
class". I'm thinking of setting up a scanner for HBase
TableInputFormat here.</li>
<li>Run as a server-side process. It should be possible for clients to
submit entire workflows, and those workflows are then detached from
the clients.</li>
<li>Ability to stop and restart a workflow part way through.</li>
<li>Ability to rerun a workflow which had a single part fail.</li>
<li>Ability to persist status between service restarts.</li>
<li>Scheduled jobs.</li>
</ul>

<p>Most of what I've listed above is available in <a href="http://issues.apache.org/jira/browse/HADOOP-5303">Oozie</a>, the Hadoop
Workflow System. Generally, it allows you to express a workflow as a
DAG using XML, it supports Java map-reduce jobs, streaming jobs, and
even Pig. You can have nodes that are file system actions, or call
outside shell scripts. Everything is persisted in MySQL and it has a
UI showing what stage each running workflow is at. On paper, Oozie
looks like the holy grail. The downside is that it isn't to the level
of polish the other projects I mentioned are. I haven't attempted to
set it up because I can't see that it has been updated for 0.20. The
last update to the Jira was in June, and it hasn't yet been committed,
even though it's just a contrib package.</p>

<p>But never fear, the latest post on Yahoo's Hadoop blog is about hiring
for the Hadoop team, and I notice that they have an opening for
"Senior Engineer for Oozie" and say that "the Oozie team is rapidly
growing". At least this means if I set it up and learn to use it, I
can trust that it isn't going to be dying off any time soon.</p>

<p>Even if nothing comes of Oozie, Cloudera is also working in this
direction. Cloudera Desktop currently has a job designer that allows
you to create and save parameterized single map-reduce jobs, for
example, to set all of the options except the input and output
directories. It doesn't yet have any workflow tools, but I'm told they
are in the works.</p>
]]>
    </content>
</entry>

<entry>
    <title>Posner explains CYA security theater</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/10/posner-explains-cya-security-theater.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.122</id>

    <published>2009-10-22T15:25:53Z</published>
    <updated>2009-10-22T19:22:12Z</updated>

    <summary>It&apos;s obvious to any rational outside observer that US terrorism policy mostly revolves around making sure people think politicians are &quot;doing something&quot;, regardless of whether something needs to be done, or whether what they&apos;re doing is the right thing. Explaining...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Politics" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="terrorism" label="terrorism" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>It's obvious to any rational outside observer that US terrorism policy mostly revolves around making sure people think politicians are "doing something", regardless of whether something needs to be done, or whether what they're doing is the right thing. Explaining the work for which Williamson won the Nobel last week, Judge Posner writes:</p>

<blockquote>
[FBI criminal-investigation functions] lend themselves to what are called "high-powered" incentives, which are systems of compensation and promotion that are based on objective performance criteria. In the case of criminal investigation these are number of arrests weighted by convictions and sentence. Intelligence work does not lend itself to such performance criteria, because the effect of surveillance and other intelligence activities in preventing terrorism or subversion is usually very difficult to assess. Hence motivation takes the form of creating a "high commitment" environment in which the organization's leaders try to elicit good performance by getting staff to internalize the organization's goals. The problem is that the absence of objective criteria of performance opens the door to "influence activities" by which members of the organization jockey for advancement.<br/><br/>
If both types of task are combined in the same organization--those that can be directed by high-powered incentives and those that require high commitment as their motivator, the best employees will tend to gravitate toward the first type of task because they will be confident that they will do well if their performance is judged according to objective criteria. They will be much less certain how well they will do in a job in which influence activities play a large role in determining success.
</blockquote>

<p>To summarize the summary, the best and the brightest will be drawn to organizations that have objective measures of success, but even more so, within a given organization, they will be drawn to these types of roles. Those who aren't very good, and especially those who can be political hacks who shamelessly talk about how the threat level is Orange today, so put on extra sunscreen, will drawn to those roles without objective measures of success, where climbing the career ladder is based on criteria other than doing the job better than the next guy.</p>

<p>Check out <a href="http://www.becker-posner-blog.com/archives/2009/10/the_economics_o_10.html">the whole article</a></p>
]]>
        

    </content>
</entry>

<entry>
    <title>Capping simultaneous tasks in Hadoop</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/10/capping-simultaneous-tasks-in-hadoop.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.121</id>

    <published>2009-10-21T07:45:17Z</published>
    <updated>2009-10-21T08:18:23Z</updated>

    <summary> We&apos;ve run into several situations in Hadoop where we want to prevent a job from using more than a certain number of slots. Some of our jobs have external resources that don&apos;t scale. One task needs to talk to...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="hadoop" label="hadoop" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p><span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://kdpeterson.net/blog/assets_c/2009/10/pools-4.html" onclick="window.open('http://kdpeterson.net/blog/assets_c/2009/10/pools-4.html','popup','width=679,height=323,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://kdpeterson.net/blog/assets_c/2009/10/pools-thumb-350x166-4.png" width="350" height="166" alt="Fair Scheduler Pools Screenshot" class="mt-image-right" style="float: right; margin: 0 0 20px 20px;" /></a></span></p>

<p>We've run into several situations in Hadoop where we want to prevent a job from using more than a certain number of slots. Some of our jobs have external resources that don't scale. One task needs to talk to a MySQL database. Another writes to our Solr cluster. These are jobs that we know beyond a certain point they don't go any faster -- if we have 200 mappers running, it's not any faster than 50. We moved to the fair scheduler partially to alleviate some of these concerns. The idea was if multiple jobs are running at once, they aren't likely to be the same type of job.</p>

<p>The other day I ran into a problem again and decided to take a look around to see if anyone had done anything in this direction. The first issue was <a href="http://issues.apache.org/jira/browse/HADOOP-5170">HADOOP-5170</a> which ended with a consensus that the functionality should be in the scheduler, not part of Map Reduce proper. <a href="http://issues.apache.org/jira/browse/MAPREDUCE-698">MAPREDUCE-698</a> is to add a per-pool simultaneous tasks cap to the Fair Scheduler, which is a much better idea than to cap it on the job level.</p>

<p>If your jobs rely on external services like a database or web service, you can run those jobs in a particular pool. If you have two jobs in this pool, then they will share the cap, and the load on your database remains constant. Also, these tasks can be assigned a set minimum on their pool to ensure that you don't have the database sitting there idle, and then have half your hadoop cluster sitting idle later when you are waiting for these jobs to finish.</p>

<p>If your jobs have very long-running tasks, like when building a Lucene index in a reducer, you may want to avoid having these jobs grab slots during gaps when there are no jobs running. I see this frequently when one job finishes, and in the time before the dependent job starts up, all the slots have been taken by another job. Without preemption, you can end up increasing latency a lot.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Using HBase TableIndexed from Thrift with unique keys</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/09/using-hbase-tableindexed-from-thrift-with-unique-keys.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.120</id>

    <published>2009-09-30T19:48:55Z</published>
    <updated>2009-10-12T23:13:31Z</updated>

    <summary>HBase is primarily a sorted distributed hash map, but it does support secondary keys through a contrib package called Transactional HBase. The secondary keys are provided by a component called TableIndexed. A good general walkthrough is Secondary indexes in HBase...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="hbase" label="hbase" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="thrift" label="thrift" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>HBase is primarily a sorted distributed hash map, but it does support secondary keys through a contrib package called Transactional HBase. The secondary keys are provided by a component called TableIndexed. A good general walkthrough is <a href="http://rajeev1982.blogspot.com/2009/06/secondary-indexes-in-hbase.html">Secondary indexes in HBase</a> on Rajeev Sharma's blog. My blog post specifically addresses how to use secondary indexes outside of the Java API and how to handle unique keys.</p>

<p>Our scenario is this: my articles table will use a row key that starts with the timestamp. This is a very common scenario in HBase because it is usually the most natural way to access information. I also have on at least some articles, a secondary key that I want to be able to do lookups by. Let's have a concrete example for further discussion.</p>

<pre>
row key  key:id  content:title
  1234     abc     ...
</pre>

<p>These keys are always unique, this is guaranteed at the application level. I would like to be able to fetch an article using the secondary key. All pretty straightforward, except I want to do this outside of Java using just Thrift, and I would prefer to do it using only get() because that makes it easier to write and easier for anyone coming along to read later. Scanners may be powerful, but it's not intuitive that a <em>get</em> using a unique secondary index would use them.</p>

<p>The first thing we need is the index specification that can create the unique key that we want. By default, HBase provides a SimpleIndexKeyGenerator that creates row keys that start with the value of the secondary key, in our example, they would be abc1234. This supports having multiple rows that match the secondary key, for example, abc5678, but means that you have to use a scanner to get at the information.</p>

<p>I've written a <a href="http://gist.github.com/187532">UniqueIndexKeyGenerator</a> that would use the exact value from the primary table's secondary key as the row key for the index table. That is, for our example, the row key in "articles-index" would be "abc". I've also added a forUniqueIndex static method to IndexSpecification to make it easier to call this from the jruby shell.</p>

<pre class="brush:java">
  /**Construct an index spec for a single column that has only unique values.
   * @param indexId the name of the index
   * @param indexedColumn the column to index
   * @return the IndexSpecification
   */
  public static IndexSpecification forUniqueIndex(String indexId, byte[] indexedColumn) {
    return new IndexSpecification(indexId, new byte[][] { indexedColumn },
        null, new UniqueIndexKeyGenerator(indexedColumn));
  }
</pre>

<p>I then add a create_index method directly in the shell. The shell is flexible enough that you have access to things like @configuration which are used by the all the other methods. Also, I was unsure whether I could do an import inside a method, but it worked just fine. This is there with the idea that it could eventually be added to the shell and if you tried to call it, would fail with a class not found if you didn't have the transactional jar in your classpath.</p>

<pre class="brush:ruby">
def create_index(table_name, index_name, column)
  import org.apache.hadoop.hbase.client.tableindexed.IndexedTableAdmin
  import org.apache.hadoop.hbase.client.tableindexed.IndexSpecification
  @iadmin ||= IndexedTableAdmin.new(@configuration)
  spec = IndexSpecification.for_unique_index(index_name, column.to_java_bytes)
  @iadmin.addIndex(table_name.to_java_bytes, spec)
end
</pre>

<p>Using this, I can create my tables in the shell easily.</p>

<pre class="brush:ruby">
create 'a', 'key', 'content'
create_index 'a', 'index', 'key:id'
</pre>

<p>This creates a table "a" with two column families, "key" and "content". It creates an index called "index" on the "id" column in the "key" column family. Internally,</p>

<p>To access this over Thrift, it's now very simple. You can look at <a href="http://gist.github.com/198256">query.rb</a> but these are the important sections. First, we need to make sure we have Thrift hooked up.</p>

<pre class="brush:ruby">
require 'rubygems'
require 'hbase'

transport = Thrift::BufferedTransport.new(Thrift::Socket.new('127.0.0.1', 9090))
protocol = Thrift::BinaryProtocol.new(transport)
client = Apache::Hadoop::Hbase::Thrift::Hbase::Client.new(protocol)
transport.open
</pre>

<p>Thrift is required by hbase, so we don't need to require it separately. I've opened a connection to my own machine on port 9090, the default for the Thrift server. I don't know the details of the above, it's just voodoo I picked up on some other site. My main loop lets me query a bunch of secondary key interactively.</p>

<pre class="brush:ruby">
STDIN.each do |id|
  id.strip!
  row_keys = client.get index, id, '__INDEX__:ROW'
  row_key_cell = row_keys[0]
  if row_key_cell
    row_key = row_key_cell.value
    puts "Found row key #{row_key}"
    value = client.get table, row_key, column
    puts "Found item #{value[0].value}"
  else
    puts "unable to find '#{id}' in index"
  end
end
</pre>

<p>This is the important part. First, we get an id, this is the value we are trying to match to key:id in the table. The first thing we do is get the contents of "<strong>INDEX</strong>:ROW" from the index table where the row key matches the id. In my case, there's always only one, but the API doesn't know that, so it returns a list of cells. The cells have a column specifier, a timestamp, and what we care about, a value. This value is the row key of the primary table. If we found anything in the secondary index, we then do a get on the primary table.</p>

<p>This is currently describing my work-in-progress. Once I've actually got everything up and running in production under load, I'll create some patches to submit to HBase. I think adding UniqueIndexKeyGenerator greatly simplifies one common use of TableIndexed, and adding support for TableIndexed to the shell makes it easy to manage schema.</p>

<p>The code is now available in <a href="https://issues.apache.org/jira/browse/HBASE-1885">HBASE-1885</a>.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Minimal HBase MapReduce Example for 0.19 and 0.20</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/09/minimal-hbase-mapreduce-example.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.119</id>

    <published>2009-09-04T23:02:00Z</published>
    <updated>2009-10-09T21:01:11Z</updated>

    <summary>HBase includes an example for populating a table from Hadoop map reduce, but it seemed overly complicated. I&apos;m getting started with HBase and this was my starting point. This first one uses the old Hadoop API with everything in the...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="hadoop" label="hadoop" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="hbase" label="hbase" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="java" label="java" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>HBase includes an example for populating a table from Hadoop map reduce, but it seemed overly complicated. I'm getting started with HBase and this was my starting point. This first one uses the old Hadoop API with everything in the mapred package, not mapreduce. It also uses the corresponding API from HBase which is now deprecated.</p>

<pre class="brush:java">
public class PopulateArticlesTable extends Configured
  implements Tool {
  public static class Map extends MapReduceBase
    implements
    Mapper&lt;LongWritable, Text, ImmutableBytesWritable, BatchUpdate&gt; {
    private ImmutableBytesWritable outKey = new ImmutableBytesWritable();

    @Override
    public void map(
      LongWritable offset,
      Text input,
      OutputCollector&lt;ImmutableBytesWritable, BatchUpdate&gt; output,
      Reporter report) throws IOException {
      // whatever format your data is in
      RichArticle art = new RichArticle(input.toString());
      // a good HBase row key, consisting of a timestamp and a unique identifier to prevent collisions. All keys are byte arrays.
      byte[] rowId = art.getRowId();
      outKey.set(rowId);
      // We execute one update for each object we encounter, that update may be composed of multiple operations, in this case, two puts
      BatchUpdate update = new BatchUpdate(rowId);
      if (art.getTitle() != null)
        update.put("content:title", Bytes
          .toBytes(art.getTitle()));
      if (art.getBody() != null)
        update.put("content:body", Bytes
          .toBytes(art.getBody()));
      output.collect(outKey, update);
    }
  }

  @Override
  public int run(String[] args) throws Exception {
    // Standard boilerplate for creating and running a hadoop job
    JobConf job = new JobConf(getConf(), this.getClass());
    String input = args[0];
    job.setJobName("Populate articles table from " + input);
    // Input is just text files in HDFS
    TextInputFormat.setInputPaths(job, new Path(input));
    job.setMapperClass(Map.class);
    job.setNumReduceTasks(0);
    // Output is to the table output format, and we set the table we want
    job.setOutputFormat(TableOutputFormat.class);
    job.set(TableOutputFormat.OUTPUT_TABLE, "articles");
    JobClient.runJob(job);
    return 0;
  }

  public static void main(String args[]) throws Exception {
    int res = ToolRunner.run(new Configuration(),
      new PopulateArticlesTable(), args);
    System.exit(res);
  }
}
</pre>

<p>Next up I've converted everything to use the 0.20 APIs. You'll see that it got shortened too, as I use GenericOptionsParser directly instead of implementing Tool.</p>

<p>For the Hadoop changes, OutputCollector is no more, it has been replaced by Context. The Mapper interface and MapReduceBase have been merged into the Mapper class, which is intended to be extended (without extending it, Mapper is the IdentityMapper). In Job control, you no longer use JobClient.run, instead calling waitForCompletion on the Job. The configuration has been cleaned up, as you'll notice I have to create a configuration prior to the Job. One big item is that you need to manually setJarByClass, which was previously taken care of by creating the JobConf with the class as a parameter. Job.setOutputFormat has changed its name to setOutputFormatClass.</p>

<p>I'm new to HBase, so I'm not as sure about whether I've done things in the recommended way. The important things to note are that you need to set the table name in the conf before creating the job, and Puts and Deletes are Hadoop Writables.</p>

<pre class="brush:java">
public class PopulateArticlesTable {
  public static class Map extends
    Mapper&lt;LongWritable, Text, NullWritable, Writable&gt; {

    @Override
    protected void map(LongWritable offset, Text input, Context context) throws IOException, InterruptedException {
      // my input is in JSON format, in other applications, you might be splitting a line of text or any of Hadoop's writable formats
      RichArticle art = new RichArticle(input.toString());
      // RichArticles are able to output a good HBase row key, consisting of a timestamp and a unique identifier to prevent collisions. All keys in HBase are byte arrays.
      byte[] rowId = art.getRowId();
      // We output multiple operations for each row
      if (art.getTitle() != null) {
        Put put = new Put(rowId);
        put.add(Bytes.toBytes("content"), Bytes.toBytes("title"), Bytes.toBytes(art.getTitle()));
        context.write(NullWritable.get(), put);
      }
      if (art.getBody() != null) {
        Put put = new Put(rowId);
        put.add(Bytes.toBytes("content"), Bytes.toBytes("body"), Bytes.toBytes(art.getBody()));
        context.write(NullWritable.get(), put);
      }
    }
  }

  public static void main(String args[]) throws Exception {
    Configuration conf = new Configuration();
    conf.set(TableOutputFormat.OUTPUT_TABLE, "articles");
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    String input = otherArgs[0];
    Job job = new Job(conf, "Populate Articles Table with " + input);
    // Input is just text files in HDFS
    FileInputFormat.addInputPath(job, new Path(input));
    job.setJarByClass(PopulateArticlesTable.class);
    job.setMapperClass(Map.class);
    job.setNumReduceTasks(0);
    // Output is to the table output format, and we set the table we want
    job.setOutputFormatClass(TableOutputFormat.class);
    job.waitForCompletion(true);
  }
}
</pre>

<p>Hopefully the before and after for the APIs is helpful. These have both been tested and work on my system. You'll need to modify them slightly, of course, unless you happen to have a data object called RichArticle that has a string serialization.</p>

<p>Update: I should probably point you at the <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html">official documentation</a> as well.</p>
]]>
        

    </content>
</entry>

<entry>
    <title> Scala Actors at SDForum</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/06/scala-actors-at-sdforum.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.118</id>

    <published>2009-06-25T03:03:49Z</published>
    <updated>2009-06-25T03:58:38Z</updated>

    <summary>I&apos;m attending tonight a presentation on actors and actors in scala, presented by SDForum Software Architecture SIG. Upcoming meeting: July 21, 3rd tuesday, Vijay Patel linked in talk about analytics. Carl Hewitt, Stanford; Robey Pointer, Twitter; Frank Sommers, Artima; Bill...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="scala" label="scala" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>I'm attending tonight a presentation on actors and actors in scala,
presented by SDForum Software Architecture SIG. Upcoming meeting: July
21, 3rd tuesday, Vijay Patel linked in talk about analytics. Carl
Hewitt, Stanford; Robey Pointer, Twitter; Frank Sommers, Artima; Bill
Venners moderating. Abstract to concrete.</p>

<p>I'll post this as-is and come back and edit it later.</p>

<h2>Carl Hewitt, Stanford, inventor of actor paradigm</h2>

<p>Carl Hewitt: back in the day in 1972 we programmed Smalltalk with a
magnetized needle and a steady hand, uphill in the snow both ways.</p>

<p>Three things: send more messages; create new addresses; decide what
state for next message. Petri nets as a model suffer from being
physically impossible. The three way model has the advantage of being
possible.
Implications of actors model breaks representation as turing machine
or in lambda calculus.</p>

<p>Cloud: it's clients, all the way up. (Title for client server
interaction on the cloud: the fog rolls in.) John MacCarthy defines
lisp in terms of lisp. "How many have seen 'eval'?" 2 hands. Review of
that lecture from 61a. Instead of eval the function, eval as a
message. If I'm X and I get an eval message with an environment. This
is the best way to define concurrent programming languages
currently. Bah, say mathematicians, that's circular! Well, too bad,
concurrency doesn't fit math.</p>

<p>Define theoretical PL ActorScript. XML and JSON instantiations. No
assignments, but not functional. Actors get replaced by their next
version.</p>

<p>Tension between well-targetted ads and user desire for privacy. But
government can't mine your data fast enough. Spooks will need to
reside inside datacenters. Try to move behavioral targetting to
client, store encrypted data on cloud. July 23rd symposium on semantic
integration at Stanford. Info at http://carlhewitt.info</p>

<p>Invented the one minute lecture. Advertisers can deliver a coherent
lecture in 30 seconds. (Fantastic idea -- I should do this) Lots of
questions following his lecture. This guy just drops the bomb in terms
of many ideas at once. Stalinist theory of computation. Lots of
parrallelism down the tree. Company model of computation. Different
departments. All of the departments talk to each other without having
to go through the CEO. That's concurrency. Map Reduce does the
parrallelism but doesn't do the concurrency.</p>

<h2>Frank Sommers - Actors in Scala</h2>

<p>Why scala makes actors natural.
Example
Scaling actors
the future</p>

<p>Immutability, OO + functional, pattern matching, easy DSL,
JVM. Mainstream language that lends itself to Actors.</p>

<p>Walks through use of Scala actors showing a chat program. When user
joins, receiver of subscribe message create a new actor to handle
updating the user. Illustrates the shorthand syntax for actors. Sync,
async and futures messages. Gets derailed by questions from audience
that are too detailed.</p>

<p>Scala actors support working in a distributed fashion, same syntax as
within a single VM. Need to import RemoteActor._, listen on a
port. Sending actor needs to know the "node", tuple of address, port
number, and symbol. Thread-per-actor and Event-driven actor
implementations. Example used thread-per-actor, but better scalability
with event driven actors, execute actors on a thread pool. Wait for
messages without consuming a thread. Can scale to millions of actors
on a single JVM using this system. Able to schedule actor sending
message to another actor within the same thread, effectively
performance of subroutine call.</p>

<p>Missed a bit, I think he's talking about how react cannot return
conventionally. But now the time is displayed in my emacs status bar,
so all is good. Now I just need to display my battery status.</p>

<p>Will be getting continuations in the future. Pluggable schedulers,
better actor isolation using compiler plugin, static checking,
integrating exceptions, actor migration (to different node? I
assume). Tensions whether actors should be more complicated, or if the
actors library should remain very basic. Also question of single
actors library or multiple actors libraries, e.g., Lift uses a simpler
library than Scala actors. Partially pragmatic concerns versus more
pure approach.</p>

<p>Still lots of theoretical problems, but quite usable for any actual
scenario.</p>

<h2>Robey Pointer -- Twitter</h2>

<p>Talk is titled "solving problems with actors". Got started with Actors
writing a chat proxy for cell phones. Long lived connections, lots of
connections, mostly idle. First attempt was with one thread per
session. Very simple, but didn't scale. Went with thread pools and
async IO. More scalable but harder to read. Fatal flaw: blocking on
other services (http). Fix all APIs to be async using hideous
callbacks. If it doesn't fit on a slide, it's not good code.</p>

<p>Actors: each session is an actor. Events are just messages. Can seek
ahead for specific events. Works will with java.nio and apache
mina. Mina wraps nio as events, his naggati library translates this
into scala messages.</p>

<p>Kestrel. Message queue. Memcache protocol as a Mina plugin. Scales
horizontally, no awareness of each other. Stats on one server: 1 month
uptime, 2.4 TB written, 4 billion gets, 1.6 billion sets.</p>

<p>Actors just one of many tools. Used synchronized for some
features. What didn't work: each queue is an actor. Move to queues
using synchronized data. Need to read this code and study it.</p>

<p>Actors are still a little shaky. Actors lifetime issues. Mixing
threads with actors make it hard to GC.</p>

<p>Lots of exciting stuff.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Make Scala lists work for you</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/06/make-scala-lists-work-for-you.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.117</id>

    <published>2009-06-18T06:15:40Z</published>
    <updated>2009-06-18T10:08:56Z</updated>

    <summary>(Well dammit, I meant to save a draft, and here I&apos;ve accidentally published it and FriendFeed spreads it around like a gossipy neighbor, so I guess I&apos;d better finish it.) I&apos;ve been working on a collaborative filtering system based on...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="dg" label="dg" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="java" label="java" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="scala" label="scala" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p><em>(Well dammit, I meant to save a draft, and here I've accidentally published it and FriendFeed spreads it around like a gossipy neighbor, so I guess I'd better finish it.)</em></p>

<p>I've been working on a collaborative filtering system based on genetic algorithms for <strike>message boards</strike>, <strike>running shoe recommendations</strike>, the Netflix Prize for a while now. The latest iteration is a mix of Java and Scala. I sat down to clean up some of the code tonight, and wanted to rewrite a function that made use of dot product.</p>

<p>Scala, like pretty much every modern programming language, has a REPL, aka. interpreter, making it really easy to work the kinks out of something before getting ant, junit, or an IDE involved. But when I go to paste the methods I need to work with into the repl, curses, foiled again! That piece of my app is in Java, I'll have to rewrite it in Scala.</p>

<p>What a wonderful opportunity to talk about how Lists will make all of your wildest dreams come true. Let's see the Java version of unit vector:</p>

<pre class="brush:java">public static double[] unit(double[] vector) {
    double[] unit = new double[vector.length];
    double norm = 0.0;
    for(double v : vector) {
        norm += v * v;
    } 
    norm = Math.sqrt(norm);
    for(int i = 0; i &lt; vector.length; i++) {
        unit[i] = vector[i] / norm;
    }
    return unit;
}</pre>

<p>Wow that's a lot of typing for such a simple concept. How would I express "divide each component by the magnitude of the vector" in scala? Well, pretty much exactly like that:</p>

<pre class="brush:scala">def unit(v : List[Double]) = {
  val sum = Math.sqrt((0.0 /: v.map(x =&gt; x * x)) {_ + _})
  v.map(x =&gt; x / sum)
}</pre>

<p>Let's walk through this one. The innermost <code>v.map(x =&gt; x * x)</code> maps the vector to its squares. The <code>/:</code> does a fold left starting with 0.0, and applying <code>{ _ + _ }</code>. Note that <code>/:</code> is a method call on the list returned by the <code>v.map...</code>. This gives the sum of squares, we take the square root to get the magnitude. Our last operation is another map. The tricky part here is the precedence of the <code>/:</code> operator means that you do need parentheses.</p>

<p>Now I have unit vectors, and I need the dot product that operates on unit vectors. If your math is hazy, dot product of unit vectors [a, b, c] and [x, y, z] is a * x + b * y + c * z. Here it is in Java:</p>

<pre class="brush:java">public static double dot(double[] a, double[] b) {
    double result = 0.0;
    for(int i = 0; i &lt; a.length; i++) {
        result += a[i] * b[i];
    }
    return result;
}</pre>

<p>Pretty straightforward, but still imposes the mental effort on you of deciding a name for that variable, and I really hate that. Bottom line? Scala lets me avoid thinking up names for variables:</p>

<pre class="brush:scala">def dot(a : List[Double], b : List[Double]) = {
  ((a zip b).map{case(x,y) =&gt; x * y} :\ 0.0) { _ + _ }
}</pre>

<p>We'll walk through this one. I love <code>zip</code>. I like the Java 5 for-each loop, but it doesn't do me any good when iterating over two lists in parallel. With Scala Lists, you can zip them together and treat them as a single list.</p>

<pre class="brush:scala">scala> List(1, 2, 3) zip List('a, 'b, 'c)
res25: List[(Int, Symbol)] = List((1,'a), (2,'b), (3,'c))</pre>

<p>When I use map, I use pattern matching to get my two items back out. Map expects a function taking a single parameter, and gives it whatever is in the list, which in this case are actually <code>Tuple2</code> objects. Pattern matching allows me to break that tuple apart. I sum the list up, this time using a fold right, which means I need to change the order of the parameters.</p>

<p>So now I'm able to play with things in the REPL, get my replacement code working, and paste it back into my project, and run the junit test to verify that my improvements didn't break anything. At 11pm, between putting the baby to bed and going to bed myself, I don't have much mental energy to hold huge methods in my head. Scala means I don't need to.</p>

<p>The downside is that then I go back to work in the morning, I sit there staring at the screen wondering why Eclipse is not formatting my Java right and displaying little red errors until I realize that I actually need semi-colons and return statements in Java.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Pack multiple small objects in S3 for cost savings</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/06/pack-multiple-small-objects-in-s3-for-cost-savings.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.116</id>

    <published>2009-06-12T07:13:57Z</published>
    <updated>2009-06-17T00:57:56Z</updated>

    <summary>Amazon Web Services offers two forms of data storage. First is S3, a key-value store allowing very large files if needed, but with a pricing model that will cause problems for small files. Second is SimpleDB, a schema-less or column-oriented...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="aws" label="aws" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="communityinsights" label="community-insights" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="java" label="java" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[<p>Amazon Web Services offers two forms of data storage. First is S3, a
key-value store allowing very large files if needed, but with a
pricing model that will cause problems for small files. Second is
SimpleDB, a schema-less or column-oriented database, which allows
storing many small pieces of data, but with limitations that prevent
it scaling to the point that S3 becomes economical. In this post I
describe a system I built to leverage SimpleDB to reduce the costs of
storing small files in S3.</p>

<h2>Context: Biz360 Community Insights</h2>

<p><a href="http://www.biz360.com">Biz360</a> Community Insights is a social media
monitoring and measurement system. We consume various types of social
media -- blogs, forums, microblogs -- perform analysis on each item,
index it, store it, and present it to the user. We process tens of
millions of items every day, ranging from under 1k for a tweet with
metadata and analysis results, up to blog posts that can go above 100k. These items are indexed in Solr, but we don't store
the full text in the index for size reasons.</p>

<h2>The Problem</h2>

<p>The problem is where to store the complete article with metadata when
we are done with it. These average around a few kb. The initial
solution was to store them in S3, but our first month's bill was thousands of dollars. Not for the storage: it was a small amount of data. Not for the
data transfer: everything from EC2 is free. The cost was the $0.01 per
1000 PUT requests. We figured there had to be a way to bring this cost
down.</p>

<p>About the same time, we were having trouble with a choke point in our
application when we did a large map-reduce job to reconcile duplicate
articles which could be sped up with a persistent lookup table of some
kind. So we needed to stop storing all our items in S3, and we needed
a secondary index on at least some of our items. The first thing to
come to mind was SimpleDB. We were already in AWS, it didn't have any
operations headaches, and although the storage cost per GB was higher,
we had now seen that it the storage cost wasn't the most significant
factor.</p>

<p>Among the other options we considered was running our own non-relational
database with the top contenders being
<a href="http://couchdb.apache.org/">CouchDB</a> and the schema-less MySQL used
by
<a href="http://bret.appspot.com/entry/how-friendfeed-uses-mysql">FriendFeed</a>. CouchDB
didn't seem all that mature, and we had concerns about the volume of
data we could manage in MySQL and the number of machines we would
need.</p>

<p>SimpleDB turned out to have significant limitations. First, it limits
the size per attribute to 1024 bytes. If we wanted to actually
store our data in SimpleDB, we would have needed to do some
complicated system splitting the large items into 1kb blocks and store
them in several attributes. Secondly, it has a limit of 10GB per
domain. We expected to store orders of magnitude more than this.</p>

<h2>The Solution</h2>

<p>The general solution is that each S3 object stores multiple items, and
two sets of SimpleDB domains provide indexes to those items.</p>

<p>All of my objects have a unique item key. In the first set of domains, this
item key will be the "itemName" for SimpleDB. Some of my items have a
secondary key, a which I'll call dupeId for this article. In my secondary
SimpleDB set, this dupeId will be the itemName, and the item key will be
one of the stored attributes.</p>

<p>Items are distributed to domains based on their key. I pick a number
of domains that I will partition across. Here you want to estimate the
total volume of data you will be storing in SimpleDB (in my case about
200 bytes per item). Choose a number of domains so that each partition
will store no more than 1GB when you get to maximum capacity.
SimpleDB is actually limited to 10GB per domain, but <strike>I've been told by
Amazon that above 1GB performance starts to degrade. Also,</strike> you will
not be able to change the number of partitions after the fact, so this
gives you sufficient headroom when you realize that you can't actually
delete items. In my case, I'm using 30 domains for my main set of
domains, and 10 domains for my secondary key (which I don't need to
retain for as long). You can have up to 100 domains without having to
contact them. They've been happy to increase our instance limit in the
past, so if you need more, just ask.</p>

<p>To store data, I need to store three things:</p>

<ol>
<li>The full item in S3, but remember this is what I want to minimize.</li>
<li>The index from the item key to the S3 location in the primary SimpleDB.</li>
<li>The secondary index from the blogHash to the item key in the
secondary SimpleDB.</li>
</ol>

<p>I will mostly ignore the third step, as it is straightforward and
there is nothing interesting about it.</p>

<h2>Implementation</h2>

<p>I've defined two services, DomainSet and DomainSetS3. My access to
SimpleDB is through via <a href="http://code.google.com/p/typica/">Typica</a>. A
domain set is the SimpleDB portion, a domain set with S3 uses a domain
set, a buffer for each partition, and an S3 service. I use
<a href="http://jets3t.s3.amazonaws.com/index.html">JetS3t</a> for S3 access.</p>

<p>A DomainSet contains an array of SimpleDB domains, and I index into
that array using hash of the item key modulo number of domains. In my
case, since we will also be using the system from our Rails front end,
I defined my own that I can keep consistent (though I could have just
duplicated java's hashcode in Ruby). The public interface consists
of only two methods:</p>

<pre class="brush:java">public interface DomainSet {
    public void store(SimpleDbItem item);
    public SimpleDbItem find(String itemName);
}</pre>

<p>SimpleDBItem is an object that contains the item name and a list of
attributes. Under the covers, this gets translated to and from the
Typica objects, but doesn't expose the complexity of the SimpleDB
API. Since I <a href="http://www.google.com/search?q=prefer+composition+to+inheritance">prefer composition to
inheritance</a>,
DomainSetS3 uses this interface, but does not extend from it. This
does add a few lines of code (I have to store numDomains and
maxFailures and use both for instantiating my DomainSet), but it
avoids having a meaningless single argument
<code>store(SimpleDbItem)</code> method. The interface is:</p>

<pre class="brush:java">public interface DomainSetS3&lt;T&gt; {
    public void store(SimpleDbItem metadata, T contents);
    public void flush();
    public SimpleDbItem find(String itemName);
    public T loadContents(SimpleDbItem item);
}</pre>

<p>In this case, I've let the abstraction leak a bit -- the only way to
load the contents of an item is to find it in SimpleDB, and then do a
separate request to load it. That's what it's doing under the covers,
of course, and if I combined both of these into one method (perhaps
returning a SimpleDbItemWithData&lt;T&gt;), later down the line I'd
have to provide the simple find for performance reasons. When you want
to store an object, you would create a SimpleDbItem with the metadata
(the key and date are the only values we currently use), and store
it. The DomainSetS3 would add these to the buffer for whatever
partition the item key hashes to until it gets to a predetermined max
buffer size. Depending on your data and access patterns, 5-20 items
per S3 object is about right. When the buffer is full, I create a
HashMap&lt;String, T&gt; and serialize the HashMap into JSON. I create
a UUID and store this JSON object in S3 under that id. I then write my
metadata with the UUID as an additional attribute "s3loc" to SimpleDB.</p>

<p>To retrieve the data from S3, I need the metadata from SimpleDB. I
fetch the object in s3loc from S3, then deserialize it back to
Java. This places limitations on what kind of objects you can store,
but it's not too painful with <a href="http://jackson.codehaus.org/">Jackson</a>
-- the only real restriction is that you know ahead of time what types
of objects you will be deserializing and that they not have any fields
which aren't concrete types. What I get back is the HashMap of String
-&gt; T, but my key here is the same as the item key, which of course
I have.</p>

<p>If I need to do updates, I will follow the same procedure as above,
which results in leaving the original item in the S3 object, but
overwriting the s3loc with the location of the updated item. We're
wasting storage space, but we will always see the correct
version. Without some form of transactions or conditional writes it
would be impossible to guarantee that you can do updates with this
scheme. It's important that anyone building a system on top of AWS
understand the consistency guarantees. Again, this problem (and lots
more!) would go away if Amazon exposed the "semantic reconciliation"
step in the Dynamo system underlying S3. Of course, this step still
exists in S3, but for simplicity the semantics are hard coded to "last
write wins".</p>

<h2>Your mileage may vary</h2>

<p>This system doesn't solve everyone's problem. If you intend to allow
outside access to your data via public S3 buckets or Cloudfront, this
won't work. Additionally, if the data you are storing is too big to
put into SimpleDB, it may be big enough that repeated downloads will
outway the cost savings on the PUT side.</p>

<p>If you are doing a large number of updates, and you are storing data
for a long time, you may find that the storage costs starts to be
significant. I suppose if you were storing too much obsolete data, you
could iterate through old items, repack these into new S3 objects, and
update SimpleDB, but at this point, your SimpleDB costs will outweigh
your S3 savings.</p>

<h2>A missed opportunity?</h2>

<p>This was an interesting technical problem, finding the best solution
with the pieces at hand, but a voice from my econ classes is
whispering in my head "arbitrage", meaning that Amazon actually has a
perverse pricing system that charges me less money to use more
resources. If I can store items in S3 by laying another system on top
of it, Amazon would certainly be able to do the same, do so more
cheaply, and without the inconsistent interface that you need to use
if you adopt my system. Unless S3 has significantly different
replication and reliability characteristics than SimpleDB, I suspect
that they are overcharging for small PUTs, and this is keeping people
from using the service to the fullest extent.</p>

<p>I've heard from <a href="http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html">Tim
Robertson</a>
that he ran into the same problem with small files and scaled back
what he was storing in S3 to save on
costs. <a href="http://daily.hotpads.com/hotpads_daily/2009/06/hotpads-on-aws.html">HotPads</a>
has also done the calculation and this was a factor in their
reluctance to recommend others adopt Cloudfront as they did.</p>

<h2>SimpleDB: a work in progress</h2>

<p>SimpleDB is a fantastic system, but it's not quite done yet. Amazon
labels it a beta, and it's certainly good enough to use in production,
but there are lots of desireable features that aren't yet there. As I
write this, I have a that has been running for a day just going
through and deleting old items from my SimpleDB domains -- bulk delete
or delete by query aren't supported yet.</p>

<p>As I learn more about other options, I'm having trouble justifying the
hassle of dealing with things like lack of bulk delete, not being able
to store large values in the table, and the size limits. HBase is
becoming more mature and is now being used to directly serve content
to the web. SimpleDB is likely slightly better value if you are
pinching pennies (for us, SimpleDB is less than 10% of our Hadoop
cluster costs), but if I were doing this again, and I were already
using Hadoop, I would prefer HBase. If you have a smaller system, and
don't want to deal with administration headaches, SimpleDB can still
be a good choice.</p>
]]>
        

    </content>
</entry>

<entry>
    <title>Vote No on everything today</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/05/vote-no-on-everything-today.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.115</id>

    <published>2009-05-19T06:37:29Z</published>
    <updated>2009-05-19T07:07:36Z</updated>

    <summary> I&apos;ve previously written that I don&apos;t like the CA LP&apos;s reasoning on some of tomorrow&apos;s ballot measures. I still think their arguments are only appealing to hard line libertarians, but I&apos;ve been persuaded by other arguments to vote against...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Politics" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[ <p>I've previously written that <a href="/blog/2009/04/ca-state-lp-misguided-on-1d-and-1e.html">I don't like the CA LP's reasoning</a> on some of tomorrow's ballot measures. I still think their arguments are only appealing to hard line libertarians, but I've been persuaded by other arguments to vote against everything tomorrow, and I encourage you to do the same.</p>

<p>I was marginally in favor of the rainy day fund aspects of Prop 1A, but as you can see from the links in the comments on my earlier post, this is a fraud, a repeat of the equally pointless prop 58 that was supposed to solve all our problems only a few years ago.</p>

<p>I argued in favor of 1D and 1E in my earlier post, because they could, in theory, have helped prevent tax hikes. But <a href="http://reason.org/studies/show/1007549.html">the Reason Foundation's analysis</a> has another take on it that I found much more persuasive. Even assuming 1D and 1E can offer short term relief (as I did), they are dangerous because they offer an underhanded technique to grow state spending. First, the taxpayers are duped into supporting an initiative to provide mental health services, or pay for idiotic anti-smoking commercials. Next, once California's public employee unions induce another budget crisis, you cut the programs and transfer the money into the general fund.</p>

<p>I don't think the original supporters of the initiatives that established these programs intended it to be used in this manner, but it would set a dangerous precedent.</p>

<p>I'd even like to see 1F, the "no raises while the budget isn't balanced" measure fail. It would be good to send a message that the con games are over.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Who are these &quot;social media experts&quot;?</title>
    <link rel="alternate" type="text/html" href="http://kdpeterson.net/blog/2009/05/who-are-these-social-media-experts.html" />
    <id>tag:kdpeterson.net,2009:/blog//2.114</id>

    <published>2009-05-14T00:59:02Z</published>
    <updated>2009-05-14T01:27:22Z</updated>

    <summary>I&apos;m following twitter closely tonight to see how the American Idol prediction got picked up. And what I don&apos;t understand is who are all these people who are retweeting it? They don&apos;t seem like real accounts. Or at least, not...</summary>
    <author>
        <name>Kevin Peterson</name>
        
    </author>
    
        <category term="Technology" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="twitter" label="twitter" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-us" xml:base="http://kdpeterson.net/blog/">
        <![CDATA[I'm following twitter closely tonight to see how the <a href="http://snurl.com/idolfinalists">American Idol prediction</a> got picked up. And what I don't understand is who are all these people who are retweeting it? They don't seem like real accounts. Or at least, not accounts that anyone who actually read twitter would be interested in following. If you search twitter for <a href="http://search.twitter.com/search?q=Biz360+Analysis+of+Social+Media">Biz360 Analysis of Social Media</a> you see hundreds of tweets, all identical. What's even more interesting is if you click through to the people sending these out. I've clicked into about 10, and without exception, they follow this pattern:<br /><br /><ul><li>Have "social media expert" or something similar in their bio.</li><li>Have fewer followers than following</li><li>Are following a huge number of people</li><li>Send out tons of tweets</li><li>Have nothing of value to say</li></ul>Let's take a look at <a href="http://twitter.com/SocialMediaWonk">@SocialMediaWonk</a> and his tweets. As of right now, he is following almost 7000 people. He has over 6000 followers.&nbsp; His first page of tweets starts this morning, so he's sending out over 20 per day. Every single one of his last 20 updates was posting a link similar to ours. No original content, no commentary.<br /><br />So even though if someone does a search for "biz360" it looks like we're all twitter is talking about, when I do a search for american idol, I need to go to page 7 before I find a mention of us. Now with seo gaming, I understand the motivation. What are these people trying to accomplish?<br />]]>
        
    </content>
</entry>

</feed>
