5

I can solve similar problem with 4 symbols. But I cannot solve it when the number of symbols is more than 4. However, here is the probability of the symbols:

enter image description here

I need to develop optimal code with the Huffman method.


The tree which I got looks like this: enter image description here


Here is how I reduced the symbols: enter image description here

user61810
  • 135

4 Answers4

6

Just follow the algorithm: combine the least frequent pair, and keep doing so. Here the least frequent are $X_1$ and $X_{11}$; they combine to make a new node $\{X_1,X_{11}\}$ with weight $0,01+0,02=0,03$. Now your list of nodes looks like this, after I reorder it by weight:

$$\begin{array}{rcc} X_i:&X_8&\{X_1,X_{11}\}&X_7&X_5&X_3&X_{10}&X_{12}&X_2&X_6&X_9&X_4\\ p_i:&0,03&0,03&0,04&0,05&0,06&0,07&0,08&0,1&0,11&0,2&0,23 \end{array}$$

The smallest two are $X_8$ and $\{X_1,X_{11}\}$, so we combine them to make a new node $\{X_1,X_8,X_{11}\}$ of weight $0,03+0,03=0,06$, and we now have the following list:

$$\begin{array}{rcc} X_i:&X_7&X_5&X_3&\{X_1,X_8,X_{11}\}&X_{10}&X_{12}&X_2&X_6&X_9&X_4\\ p_i:&0,04&0,05&0,06&0,06&0,07&0,08&0,1&0,11&0,2&0,23 \end{array}$$

Now combine $X_7$ and $X_5$ to make a node $\{X_5,X_7\}$ of weight $0,09$, reducing the list to this:

$$\begin{array}{rcc} X_i:&X_3&\{X_1,X_8,X_{11}\}&X_{10}&X_{12}&\{X_5,X_7\}&X_2&X_6&X_9&X_4\\ p_i:&0,06&0,06&0,07&0,08&0,09&0,1&0,11&0,2&0,23 \end{array}$$

The next step combines $X_3$ with $\{X_1,X_8,X_{11}\}$, and the one after that combines $X_{10}$ and $X_{12}$ to produce the list

$$\begin{array}{rcc} X_i:&\{X_5,X_7\}&X_2&X_6&\{X_1,X_3,X_8,X_{11}\}&\{X_{10},X_{12}\}&X_9&X_4\\ p_i:&0,09&0,1&0,11&0,12&0,15&0,2&0,23 \end{array}$$

At this stage your your graph looks like this:

       *         *         *         *            *         *        *
      / \       X2        X6        / \          / \       X9       X4
     /   \                         /   \        /   \  
    X5   X7                       *     *      X10  X12
                                 X3    / \        
                                      /   \  
                                     *     *  
                                    / \   X8  
                                   /   \  
                                  *     *  
                                 X1    X11

Can you finish it now? Just keep on doing the same thing until it becomes a single tree. There are seven separate pieces now, so that will take six more steps. Once you have the tree, we can worry about how to get the actual Huffman code from it.

Added: There are two possible trees; here’s a rough sketch of one of them.

enter image description here

Brian M. Scott
  • 616,228
  • Could you please see my tree and tell me if it the right one? – user61810 Mar 25 '13 at 04:56
  • @user61810: I can see one problem with it even before checking any of the details: you have a three-way branch at the top, which is impossible. The final tree must be a binary tree: every branch is a two-way branch. Looking more closely, I see that you combined ${X_5,X_7}$ with $X_2$, which is fine, and ${X_1,X_3,X_8,X_{11}}$ with $X_6$, which is also fine. Those sets have weights $0,19$ and $0,23$, so the smallest weights now are this $0,19$ and ${X_{10},X_{12}}$ at $0,15$. It looks like you combined them correctly at the righthand side of your tree. Then I think you combined $X_9$ ... – Brian M. Scott Mar 25 '13 at 18:58
  • ... and $X_4$, which is legitimate. (You could instead have combined $X_9$ with the other node of weight $0,23$.) At that point you had a node of weight $66$, combining $X_1,X_3,X_4,X_6,X_8,X_9$, and $X_{11}$, and another of weight $34$ combining the other five nodes. Those two should be side by side, daughters of a single node of weight $100$ up at the very top. The five labelled nodes on the righthand side of your tree are all one level too high: the path from the root to $X_5$, for instance, should have $4$ edges, not $3$. – Brian M. Scott Mar 25 '13 at 19:04
  • I still don't understand how to draw the tree. I needed 10 steps but the final tree has 6 generations. Could you please see the tree below posted by Marko Riedel? It does not seem right. Why $X_4$ is on the left? – user61810 Apr 01 '13 at 01:02
  • I’ve added a sketch of one of the two possible trees. Since you started with $12$ symbols, you should have required $11$ steps; the reason you required $10$ is that you made the mistake of combining three at once at your last step. You much always combine just two at a time. Finally, the number of steps does not tell you how many levels you will get. As you see, I have $7$ levels; the other possible tree also has $7$. – Brian M. Scott Apr 01 '13 at 01:30
  • I think in your diagram you have two instances of $X9$, one of these should be $X7.$ – Marko Riedel Apr 01 '13 at 03:22
  • @Marko: Thanks; the one on the right should be $X_7$. Fixed. – Brian M. Scott Apr 01 '13 at 03:28
  • @Scott: I'm glad to see that your tree agrees with mine and I hope this will help the OP to get a better understanding. Except your placement of $X4$ and $X9$ is slightly different. Are you sure you used a priority queue for the intermediate results (trees)? Note: offline in fifteen minutes. – Marko Riedel Apr 01 '13 at 03:37
  • @Marko: You have the other tree. At one point in the construction there are two nodes of weight $23$, and we chose different ones to combine with $X_9$. – Brian M. Scott Apr 01 '13 at 03:43
3

This algorithm is very simple, as the other posts point out. Doing your example on paper takes almost as long as writing a program, so here is the program.

First, some sample runs including your example.

$ ./huffman.pl 0.1 0.1 0.1 0.4 0.3
X01 0.100000 1110
X02 0.100000 1111
X03 0.100000 110
X04 0.400000 0
X05 0.300000 10

$ ./huffman.pl 0.01 0.1 0.06 0.23 0.05 0.11 0.04 0.03 0.2 0.07 0.02 0.08
X01 0.010000 011110
X02 0.100000 1111
X03 0.060000 0110
X04 0.230000 10
X05 0.050000 11101
X06 0.110000 010
X07 0.040000 11100
X08 0.030000 01110
X09 0.200000 00
X10 0.070000 1100
X11 0.020000 011111
X12 0.080000 1101

$ ./huffman.pl 15 7 6 6 5
scaling by a factor of 39 at ./huffman.pl line 35.
X01 0.384615 0
X02 0.179487 111
X03 0.153846 101
X04 0.153846 110
X05 0.128205 100

The algorithm follows, implemented in Perl. There is not much to say about it: start with a forest of singleton trees and iteratively keep merging the two with the smallest sum value, recording the cumulative sums. Traverse the resulting tree with the path to a leaf giving the Huffman code of that leaf.

#! /usr/bin/perl -w
#

sub buildcode {
    my ($cref, $pref, $t) = @_;

    if(exists($t->{label})){
      $cref->{$t->{label}} = $pref;
      return;
    }

    buildcode($cref, $pref . '0', $t->{left});
    buildcode($cref, $pref . '1', $t->{right});
}


MAIN: {
    my @freq = @ARGV;

    die "need at least one symbol "
      if scalar(@freq) == 0;

    my $n = scalar(@freq);

    my $total = 0;
    for(my $pos=0; $pos<$n; $pos++){
      my $val = $freq[$pos];
      die "not a decimal number: $val"
          if $val !~ /^\d+(\.\d*)?$/;

      $total += $freq[$pos];
    }

    if(abs(1-$total) > 1e-12){
      warn "scaling by a factor of $total";

      for(my $pos=0; $pos<$n; $pos++){
          $freq[$pos] /= $total;
      }
    }

    my @pool;

    for(my $pos=0; $pos<$n; $pos++){
      push @pool, 
      { sum => $freq[$pos], label => "X" . ($pos+1), };
    }

    @pool = sort { $a->{sum} <=> $b->{sum} } @pool;

    while(scalar(@pool) >= 2){
      my ($ma, $mb);

      $ma = shift @pool; $mb = shift @pool;

      my $node = {
          sum => $ma->{sum} + $mb->{sum},
          left => $ma, right => $mb
      };

      my $pos;
      for($pos = 0; $pos<scalar(@pool); $pos++){
          last if $node->{sum} < $pool[$pos]->{sum};
      }
      splice @pool, $pos, 0, $node;
    }

    my $code = {};
    buildcode $code, '', $pool[0];

    for(my $pos=0; $pos<$n; $pos++){
      printf "X%02d %05f %s\n", $pos+1,
      $freq[$pos], $code->{'X' . ($pos+1)};
    }

    1;
}
Marko Riedel
  • 61,317
0

In response to the request for a drawing of the tree as opposed to a list I am sending some Perl code that outputs an ASCII tree. I hope it can be of use to those studying Huffman codes and perhaps make it easier to understand what is going on.

Here are some examples:

$ ./huffman-tree.pl 0.1 0.1 0.1 0.4 0.3
+-0--X004 0.400000 0
|
+-1--+-0--X005 0.300000 10
     |
     +-1--+-0--X003 0.100000 110
          |
          +-1--+-0--X001 0.100000 1110
               |
               +-1--X002 0.100000 1111

$ ./huffman-tree.pl 0.01 0.1 0.06 0.23 0.05 0.11 0.04 0.03 0.2 0.07 0.02 0.08
+-0--+-0--X009 0.200000 00
|    |
|    +-1--+-0--X006 0.110000 010
|         |
|         +-1--+-0--X003 0.060000 0110
|              |
|              +-1--+-0--X008 0.030000 01110
|                   |
|                   +-1--+-0--X001 0.010000 011110
|                        |
|                        +-1--X011 0.020000 011111
|
+-1--+-0--X004 0.230000 10
     |
     +-1--+-0--+-0--X010 0.070000 1100
          |    |
          |    +-1--X012 0.080000 1101
          |
          +-1--+-0--+-0--X007 0.040000 11100
               |    |
               |    +-1--X005 0.050000 11101
               |
               +-1--X002 0.100000 1111

$ ./huffman-tree.pl 15 7 6 6 5
scaling by a factor of 39 at ./huffman-tree.pl line 87.
+-0--X001 0.384615 0
|
+-1--+-0--+-0--X005 0.128205 100
     |    |
     |    +-1--X003 0.153846 101
     |
     +-1--+-0--X004 0.153846 110
          |
          +-1--X002 0.179487 111

This is the code.

#! /usr/bin/perl -w
#

sub max {
    my ($a, $b) = @_;

    return ($a<$b ? $b : $a);
}

sub calc_dims {
    my ($path, $t) = @_;

    $t->{path} = $path;

    if(exists($t->{label})){
      $t->{sumstr} = sprintf "%05f", $t->{sum};
      $t->{width} = 2 + length($t->{label}) + length($path)
          + length($t->{sumstr});
      $t->{height} = 1;
      return;
    }

    calc_dims($path . '0', $t->{left});
    calc_dims($path . '1', $t->{right});

    $t->{width} = 
      5 + max($t->{left}{width}, $t->{right}{width});
    $t->{height} = 
      1 + $t->{left}{height} + $t->{right}{height};
}

sub draw_tree {
    my ($b, $x, $y, $t) = @_;

    if(exists($t->{label})){
      my (@letters) = 
          split(//, $t->{label} . ' ' . $t->{sumstr} .
              ' ' . $t->{path});
      for(my $ltr=0; $ltr<scalar(@letters); $ltr++){
          $b->[$y][$x+$ltr] = $letters[$ltr];
      }
      return;
    }

    $b->[$y][$x] = '+';
    $b->[$y][$x+1] = '-';
    $b->[$y][$x+2] = '0';
    $b->[$y][$x+3] = '-';
    $b->[$y][$x+4] = '-';

    draw_tree($b, $x+5, $y, $t->{left});

    my $pos;
    for($pos=1; $pos<=$t->{left}{height}; $pos++){
      $b->[$y+$pos][$x] = '|';
    }

    $y += $pos;

    $b->[$y][$x] = '+';
    $b->[$y][$x+1] = '-';
    $b->[$y][$x+2] = '1';
    $b->[$y][$x+3] = '-';
    $b->[$y][$x+4] = '-';

    draw_tree($b, $x+5, $y, $t->{right});
}

MAIN: {
    my @freq = @ARGV;

    die "need at least one symbol "
      if scalar(@freq) == 0;

    my $n = scalar(@freq);

    my $total = 0;
    for(my $pos=0; $pos<$n; $pos++){
      my $val = $freq[$pos];
      die "not a decimal number: $val"
          if $val !~ /^\d+(\.\d*)?$/;

      $total += $freq[$pos];
    }

    if(abs(1-$total) > 1e-12){
      warn "scaling by a factor of $total";

      for(my $pos=0; $pos<$n; $pos++){
          $freq[$pos] /= $total;
      }
    }

    my @pool;

    for(my $pos=0; $pos<$n; $pos++){
      my $label = sprintf "X%03d", ($pos+1);
      push @pool, 
      { sum => $freq[$pos], label => $label };
    }

    @pool = sort { $a->{sum} <=> $b->{sum} } @pool;

    while(scalar(@pool) >= 2){
      my ($ma, $mb);

      $ma = shift @pool; $mb = shift @pool;

      my $node = {
          sum => $ma->{sum} + $mb->{sum},
          left => $ma, right => $mb
      };

      my $pos;
      for($pos = 0; $pos<scalar(@pool); $pos++){
          last if $node->{sum} < $pool[$pos]->{sum};
      }
      splice @pool, $pos, 0, $node;
    }

    calc_dims('', $pool[0]);

    my $board = [];
    for(my $row=0; $row<$pool[0]->{height}; $row++){
      push @$board, [(' ') x ($pool[0]->{width})];
    }

    draw_tree($board, 0, 0, $pool[0]);

    for(my $row=0; $row<$pool[0]->{height}; $row++){
      for(my $col=0; $col<$pool[0]->{width}; $col++){
          print $board->[$row][$col];
      }
      print "\n";
    }

    1;
}
Marko Riedel
  • 61,317
  • Thank you. Unfortunately, I cannot compile this. It gives me error on line 72. – user61810 Mar 26 '13 at 08:42
  • I just verified the above posted code and on my side it runs fine. I suppose a debugging session would probably be beyond what is acceptable on stackexchange.com. If you would post one error message maximum I could look at it. My version of Perl is 5.14. – Marko Riedel Mar 26 '13 at 11:13
  • You should run it from the command line with the frequencies as arguments as shown in the examples. – Marko Riedel Mar 26 '13 at 11:22
  • Could you please post the generated tree instead. All I need is to compare my tree which I posted above with the one generated by the program. Thank you – user61810 Mar 26 '13 at 14:31
  • The tree is shown above. It is the best diagram I was able to produce without making a major coding effort. – Marko Riedel Mar 26 '13 at 18:42
  • But why $X_4$ is on the left hand side of the tree? It should be on the right together with $X_1, X_3, X_4, X_6, X_8, X_9, X_11$. – user61810 Apr 01 '13 at 01:05
0

In response to the user comment regarding the layout of the binary tree I think a case can be made for the layout to be flipped, with a zero bit going on the left and a one bit going on the right. I have implemented this tree layout below. It is important to remember that Huffman trees produced by the basic algorithm are not unique. Therefore I have also modified the program to print all Huffman trees. Warning: this will produce a lot of output. Try on small sets of frequencies first. We could impose additional constraints to reduce the number of solutions, for example that smaller values always go on the left. The only remaining ambiguity in that case would be when the left is equal to the right.

#! /usr/bin/perl -w
#

sub max {
    my ($a, $b) = @_;

    return ($a<$b ? $b : $a);
}

sub numeq {
    my ($a, $b) = @_;

    return (abs($a-$b) < 1e-10 ? 1 : 0);
}

sub alltrees {
    my ($poolref, $solref) = @_;

    if(scalar(@$poolref) == 1){
      push @$solref, $poolref->[0];
      return;
    }

    my $prefix = [shift(@$poolref)];

    my $pair = undef;

    my $val = $prefix->[0]->{sum};
    while(scalar(@$poolref)>0 && numeq($poolref->[0]->{sum}, $val)){
      push @$prefix, shift(@$poolref);
    }

    if(scalar(@$prefix)==1){
      $pair = 1;

      $val = $poolref->[0]->{sum};
      while(scalar(@$poolref)>0 && numeq($poolref->[0]->{sum}, $val)){
          push @$prefix, shift(@$poolref); 
      }
    }

    my $preflen = scalar(@$prefix);

    my $total = $prefix->[0]->{sum} + $prefix->[1]->{sum};

    my $pos;
    for($pos = 0; $pos<scalar(@$poolref); $pos++){
      last if $total < $poolref->[$pos]->{sum};
    }

    for(my $i=0; $i<(defined($pair) ? 1 : $preflen); $i++){
      for(my $j=$i+1; $j<$preflen; $j++){
          my ($ma, $mb) = ($prefix->[$i], $prefix->[$j]);
          my ($node, @newpool);

          $node = {
            sum => $total,
            left => $ma, right => $mb
          };

          @newpool = 
            (@$poolref[0..($pos-1)], 
             $node, 
             @$poolref[$pos..$#$poolref]);

          for(my $k=0; $k<$preflen; $k++){
            unshift(@newpool, $prefix->[$k])
                if $k != $i && $k != $j;
          }

          alltrees(\@newpool, $solref);
      }
    }
}

sub calc_dims {
    my ($path, $t) = @_;

    $t->{path} = $path;

    if(exists($t->{label})){
      $t->{sumstr} = sprintf "%05f", $t->{sum};
      $t->{width} = 2 + length($t->{label}) + length($path)
          + length($t->{sumstr});
      $t->{height} = 1;
      return;
    }

    calc_dims($path . '0', $t->{left});
    calc_dims($path . '1', $t->{right});

    $t->{width} = 
      5 + max($t->{left}{width}, $t->{right}{width});
    $t->{height} = 
      1 + $t->{left}{height} + $t->{right}{height};
}

sub draw_tree {
    my ($b, $x, $y, $t) = @_;

    if(exists($t->{label})){
      my (@letters) = 
          split(//, $t->{label} . ' ' . $t->{sumstr} .
              ' ' . $t->{path});
      for(my $ltr=0; $ltr<scalar(@letters); $ltr++){
          $b->[$y][$x+$ltr] = $letters[$ltr];
      }
      return;
    }

    $b->[$y][$x] = '+';
    $b->[$y][$x+1] = '-';
    $b->[$y][$x+2] = '1';
    $b->[$y][$x+3] = '-';
    $b->[$y][$x+4] = '-';

    draw_tree($b, $x+5, $y, $t->{right});

    my $pos;
    for($pos=1; $pos<=$t->{right}{height}; $pos++){
      $b->[$y+$pos][$x] = '|';
    }

    $y += $pos;

    $b->[$y][$x] = '+';
    $b->[$y][$x+1] = '-';
    $b->[$y][$x+2] = '0';
    $b->[$y][$x+3] = '-';
    $b->[$y][$x+4] = '-';

    draw_tree($b, $x+5, $y, $t->{left});
}

MAIN: {
    my @freq = @ARGV;

    die "need at least one symbol "
      if scalar(@freq) == 0;

    my $n = scalar(@freq);

    my $total = 0;
    for(my $pos=0; $pos<$n; $pos++){
      my $val = $freq[$pos];
      die "not a decimal number: $val"
          if $val !~ /^\d+(\.\d*)?$/;

      $total += $freq[$pos];
    }

    if(abs(1-$total) > 1e-12){
      warn "scaling by a factor of $total";

      for(my $pos=0; $pos<$n; $pos++){
          $freq[$pos] /= $total;
      }
    }

    my @pool;

    for(my $pos=0; $pos<$n; $pos++){
      my $label = sprintf "X%03d", ($pos+1);
      push @pool, 
      { sum => $freq[$pos], label => $label };
    }

    my @allsols;
    @pool = sort { $a->{sum} <=> $b->{sum} } @pool;

    alltrees(\@pool, \@allsols);

    foreach my $sol (@allsols){
      calc_dims('', $sol);

      my $board = [];
      for(my $row=0; $row<$sol->{height}; $row++){
          push @$board, [(' ') x ($sol->{width})];
      }

      draw_tree($board, 0, 0, $sol);

      for(my $row=0; $row<$sol->{height}; $row++){
          for(my $col=0; $col<$sol->{width}; $col++){
            print $board->[$row][$col];
          }
          print "\n";
      }

      print "\n";
    }

    1;
}

This produces the following output.

$ ./huffman-tree-order.pl 0.01 0.1 0.06 0.23 0.05 0.11 0.04 0.03 0.2 0.07 0.02 0.08
+-1--+-1--+-1--+-1--X002 0.100000 1111
|    |    |    |
|    |    |    +-0--+-1--X005 0.050000 11101
|    |    |         |
|    |    |         +-0--X007 0.040000 11100
|    |    |
|    |    +-0--+-1--X012 0.080000 1101
|    |         |
|    |         +-0--X010 0.070000 1100
|    |
|    +-0--X004 0.230000 10
|
+-0--+-1--+-1--+-1--+-1--+-1--X011 0.020000 011111
     |    |    |    |    |
     |    |    |    |    +-0--X001 0.010000 011110
     |    |    |    |
     |    |    |    +-0--X008 0.030000 01110
     |    |    |
     |    |    +-0--X003 0.060000 0110
     |    |
     |    +-0--X006 0.110000 010
     |
     +-0--X009 0.200000 00

+-1--+-1--+-1--+-1--X002 0.100000 1111
|    |    |    |
|    |    |    +-0--+-1--X005 0.050000 11101
|    |    |         |
|    |    |         +-0--X007 0.040000 11100
|    |    |
|    |    +-0--+-1--X012 0.080000 1101
|    |         |
|    |         +-0--X010 0.070000 1100
|    |
|    +-0--+-1--+-1--+-1--+-1--X011 0.020000 101111
|         |    |    |    |
|         |    |    |    +-0--X001 0.010000 101110
|         |    |    |
|         |    |    +-0--X008 0.030000 10110
|         |    |
|         |    +-0--X003 0.060000 1010
|         |
|         +-0--X006 0.110000 100
|
+-0--+-1--X004 0.230000 01
     |
     +-0--X009 0.200000 00

$ ./huffman-tree-order.pl 0.1 0.1 0.1 0.4 0.3
+-1--+-1--+-1--+-1--X002 0.100000 1111
|    |    |    |
|    |    |    +-0--X001 0.100000 1110
|    |    |
|    |    +-0--X003 0.100000 110
|    |
|    +-0--X005 0.300000 10
|
+-0--X004 0.400000 0

+-1--+-1--+-1--+-1--X003 0.100000 1111
|    |    |    |
|    |    |    +-0--X001 0.100000 1110
|    |    |
|    |    +-0--X002 0.100000 110
|    |
|    +-0--X005 0.300000 10
|
+-0--X004 0.400000 0

+-1--+-1--+-1--+-1--X003 0.100000 1111
|    |    |    |
|    |    |    +-0--X002 0.100000 1110
|    |    |
|    |    +-0--X001 0.100000 110
|    |
|    +-0--X005 0.300000 10
|
+-0--X004 0.400000 0

$ ./huffman-tree-order.pl  15 7 6 6 5
scaling by a factor of 39 at ./huffman-tree-order.pl line 153.
+-1--+-1--+-1--X002 0.179487 111
|    |    |
|    |    +-0--X004 0.153846 110
|    |
|    +-0--+-1--X003 0.153846 101
|         |
|         +-0--X005 0.128205 100
|
+-0--X001 0.384615 0

+-1--+-1--+-1--X002 0.179487 111
|    |    |
|    |    +-0--X003 0.153846 110
|    |
|    +-0--+-1--X004 0.153846 101
|         |
|         +-0--X005 0.128205 100
|
+-0--X001 0.384615 0
Marko Riedel
  • 61,317