0

I am writing a TensorFlow program which is trying to categorize a heavily skewed dataset between two categories. One category is represented at 30x the rate of the other.

Category 0: 800
Category 1: 30000

The label for each record is a one-hot vector of length 2, representing the two categories. Using a cross entropy cost equation doesn't work very well since a very high accuracy can be attained just by rating everything as category 1.

I then made a balanced dataset that has the following data:

Category 0: 800
Category 1: 800

This dataset has about 67% accuracy using softmax, but there is a real-world cost when falsely categorizing a category 1 item as category 0, which is blown way up when mistakenly categorizing many more items as category 0 mistakenly.

I wanted to fix this by making the cost equation penalize false categorization into category 0 more than other penalties. However, my cost equation is not producing any different results than the original cross entropy equation. I think I may have programmed it in wrong, but I am unsure where the mistake is.

In theory, the skewed equation will add 1 to the normal cost when the label for a record is [0,1] and the predicted value is [1,0], which represents a miscategorization of a category 1 item into category 0.

The skewed cost equation is in the condition where FLAGS.cost == 'skewed':

x = tf.placeholder(tf.float32, shape=[None, data_length])

W = tf.Variable(tf.zeros([data_length, 2]))
b = tf.Variable(tf.zeros([2]))

y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder(tf.float32, shape=[None, 2])

if FLAGS.cost == 'cross_entropy':
    cost = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y,1e-10,1.0)))
elif FLAGS.cost == 'skewed':
    positive_cutoff = tf.constant([0.5, 0.5])
    desired_tensor = tf.constant([False, True])
    true_tensor = tf.fill(tf.pack([2, tf.shape(y)[0], 2]), True)
    cost = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y,1e-10,1.0)))+tf.to_float(
        tf.reduce_all(
            tf.logical_and(
                tf.pack([
                    tf.logical_and(tf.greater_equal(y_, positive_cutoff), desired_tensor),
                    tf.logical_and(tf.less_equal(y, positive_cutoff), desired_tensor)
                ]),#shape [2,x,2]
                true_tensor#shape [2,x,2]
            )#shape [2,x,2]
        )#shape [1]
    )

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
user24984
  • 1
  • 2

1 Answers1

0

The equation I ended up using is listed in the link below, provided by Emre:

Linear regression with non-symmetric cost function?

user24984
  • 1
  • 2