Combine 2 vectors to get a new vector keeping the most information?

Question

I have 2 vectors $ a, b \in \mathbb{R}^{k} $, which can be thought of as feature vectors in machine learning.

By simple transformation (linear, affine, concatenate...), I want to combine them into a vector $ c \in \mathbb{R}^{l}, l < k $ and keep the most information.

If $ l \geq 2k $, then I think I can just concatenate them. But if $ l < k $, what should I do?

Should I concat them, then multiply by a matrix to reduce dimension to $ l $?
Or should I multiply each of them by a matrix to reduce dimension to $ l/2 $, then concat the result?
Or should I just multiply each of them by a matrix to reduce dimension to $ l $, then averaging the result?

Or is it no matter what order I do?

How do you define "information"? How do you tell how much or how little information you have kept? — Gerry Myerson, Jun 29 '18 at 10:30
@GerryMyerson "information" in machine learning perspective, I don't know if it could be define rigorously. It's like discriminability. For example concat keep all information. — THN, Jun 29 '18 at 10:33
i would suggest to select any bijection $\Phi: \mathbb{R}\times\mathbb{R}\rightarrow \mathbb{R}$ and define $c$ by $c_i:=\Phi(a_i,b_i)$. (yes, there are also simple bijections of this type.) — Max, Jun 29 '18 at 13:20
@Max I'm interested in what is the form of this "bijection". But maybe in general, bijection does not exist, because $ l < 2k $. Anyway I would like to hear any thought on this problem. — THN, Jul 01 '18 at 07:35
https://math.stackexchange.com/questions/183361/examples-of-bijective-map-from-mathbbr3-rightarrow-mathbbr?lq=1 https://math.stackexchange.com/questions/243590/bijection-from-mathbb-r-to-mathbb-rn — Max, Jul 07 '18 at 20:11

score 1 · Accepted Answer · edited Jun 29 '18 at 13:07

1

Learn it! This is what autoencoders do.

Make a neural network with three layers: an input layer of dimension $2k$ neurons, a hidden layer of $l$ neurons and an output layer of $2k$ neurons. Train this neural network to simply predict the identity function on your dataset.

The result will be that the neural network learns a 'summary' of the data in $l$ features (the hidden layer) that can be used to reconstruct the input.

Now after training simply delete the output layer and use values of the hidden layer as your features.

Alternatively you can go with good old principal component analysis.

edited Jun 29 '18 at 13:07

Gerry Myerson

179,216

answered Jun 29 '18 at 10:33

orlp

10,508

Actually my question is regarding how to prepare to learn it, that is what is the structure of the first layer. – THN Jun 29 '18 at 10:35
@THN And I'm saying that the reduction from $2k$ features to $l$ features can be learned as well. – orlp Jun 29 '18 at 10:35
I see your point. You are talking about finding the parameter that can reconstruct the data. But what structure should those parameters be in? – THN Jun 29 '18 at 10:37
As for the autoencoder structure you said, it actually does matrix multiplying then averaging. I'm concerning which structure is better? – THN Jun 29 '18 at 10:39
@THN I don't understand your question. After training the autoencoder you only keep the encoder portion of it and use that as the input for the actual neural network you want to train. – orlp Jun 29 '18 at 10:40
My question is more about theoretical comparison between different ways to combine vectors. There are many choices to setup the first layer of autoencoder, which would be the best? – THN Jun 29 '18 at 10:44
@THN I just meant the usual fully connected hidden layer with $l$ neurons with the usual activation function like ReLU or tanh. If you believe your data is more structured you can add more hidden layers, as long as you keep one hidden layer with $l$ neurons. – orlp Jun 29 '18 at 10:46
PCA seems not for my task. Anyway, I am working out the math, they "seems" all equivalent in term of learnability (thus keeping same information)... – THN Jun 29 '18 at 10:52
1

@THN PCA and autoencoders with one hidden layer are very similar - they both do a linear component analysis. If you add more hidden layers to an autoencoder it can also handle non-linear relationships. – orlp Jun 29 '18 at 10:55
Ah yes, PCA is like linear autoencoder, I was captured in thinking about 2 different input vectors. So maybe we could say something more strong here, PCA is the best for linear case anyway? – THN Jun 29 '18 at 11:00
Ok, nice, I think all are equivalent, at least some cases that I concern. Autoencoder and PCA is indeed the best structure in terms of keeping information and simple. Thanks. – THN Jun 29 '18 at 11:25

Combine 2 vectors to get a new vector keeping the most information?

1 Answers1