11

I have learned that Keras has a functionality to "merge" two models according to the following:

from keras.layers import Merge

left_branch = Sequential()
left_branch.add(Dense(32, input_dim=784))

right_branch = Sequential()
right_branch.add(Dense(32, input_dim=784))

merged = Merge([left_branch, right_branch], mode='concat')

What is the point in mergint NNs, in which situations is it useful? Is it a kind of ensemble modelling? What is the difference between the several "modes" (concat, avg, dot etc...) in the sense of performance?

Hendrik
  • 8,587
  • 17
  • 42
  • 55

1 Answers1

14

It is used for several reasons, basically it's used to join multiple networks together. A good example would be where you have two types of input, for example tags and an image. You could build a network that for example has:

IMAGE -> Conv -> Max Pooling -> Conv -> Max Pooling -> Dense

TAG -> Embedding -> Dense layer

To combine these networks into one prediction and train them together you could merge these Dense layers before the final classification.

Networks where you have multiple inputs are the most 'obvious' use of them, here is a picture that combines words with images inside a RNN, the Multimodal part is where the two inputs are merged:

Multimodal Neural Network

Another example is Google's Inception layer where you have different convolutions that are added back together before getting to the next layer.

To feed multiple inputs to Keras you can pass a list of arrays. In the word/image example you would have two lists:

x_input_image = [image1, image2, image3]
x_input_word = ['Feline', 'Dog', 'TV']
y_output = [1, 0, 0]

Then you can fit as follows:

model.fit(x=[x_input_image, x_input_word], y=y_output]
Jan van der Vegt
  • 9,368
  • 35
  • 52
  • Sorry, I cannot see the point in building separate networks for both the training instances and the labels while there is a possibility to feed these in a single network in the fitting phase which does the job anyway. I can see that merging is a possibility but not its advantage over "non-merging". – Hendrik Aug 16 '16 at 07:59
  • How do you feed them in the fitting phase? The inputs are always seperate, you cannot use your convolution layer on your labels so these layers need to be merged somehow. – Jan van der Vegt Aug 16 '16 at 08:00
  • In Keras model.fit() accepts both X and y for fitting and model in this case can be an "non-merged" model as well. Pretty much like other model types in Sklearn for example. – Hendrik Aug 16 '16 at 08:13
  • 3
    Labels might be a poorly chosen name from my side, let's say you have a picture and the annotation with that picture, and you want to classify if that combination is about cats or not, then you have two types of input, and one binary output. To get the synergy between them you will have to merge the layers somewhere. Another example is where you have two pictures, one from the top and one from the bottom that you have to classify together – Jan van der Vegt Aug 16 '16 at 08:16
  • I've added a picture that hopefully makes things a bit clearer what I meant – Jan van der Vegt Aug 16 '16 at 08:19
  • I think worth emphasising that merging is not a "per network" thing, but is essentially combining two layers into a single virtual layer. This enables combinations of architecture components within a network. It is not dependent on having different input types, but does help solve that problem ("I want an RNN because that's best for word sequences, but I also want a CNN because that is best for images - if only there was some way of combining them . . .") – Neil Slater Aug 16 '16 at 08:50
  • How do you evaluate the performance of the member models respectively? How do you know which branch is weak if the merged model underperforms? And how do you define the output/input layer node number at merging phase? This seems independent of the number of the final output layer, doesn't it? – Hendrik Aug 16 '16 at 08:50
  • 3
    @Hendrik: There aren't "component models", there is only one model. It is a complex one, enabled by the layer merging feature. You evaluate it as you do for any single model - i.e. with a metric against a hold-out test data set (in the image/words example with data comprising images, associated partial text and the next word as the label to predict). If you want, you can inspect the layers within the model to see what they are doing - e.g. the analysis of CNN features can still be applied to the convolutional layers. – Neil Slater Aug 16 '16 at 08:52
  • Yes it's about merging layers, not networks. Different layers are added into one layer and all of it together is one network – Jan van der Vegt Aug 16 '16 at 08:57
  • But why would you do this "image-tag" biclassification in this cumbersome way? It seems me a two-step process of "conventional" prediction (with any proper model, ANN or not): one step is to multiclassify the sentences (for mining topic) and then a biclassification to decide if the image is related to the topic. What is the additional advantage of this complex networks? More accuracy? – Hendrik Aug 16 '16 at 09:04
  • Yes, better accuracy. There are all kinds of correlations between inputs that belong together that can be captured if you do it within one network. You are throwing away extra information if you don't do it in one network – Jan van der Vegt Aug 16 '16 at 09:05
  • @Hendrik: Specifically in the example, you don't know what the "Image Representation" vector should be in order to be salient to the description. Having this feature being learned, as opposed to being in a constructed pipeline, seems to work well as a strategy (similar arguments hold for why CNNs seem to work better at extracting image features automatically compared to visual bag-of-words or Sobel filters etc - yes you can use those alternatives and performance may be good enough, but the deep CNN will often give better performance, if you have the data) – Neil Slater Aug 16 '16 at 14:08
  • I see, gentlemen. But how do you feed more than one test data set into one (merged) network to fit and predict? I mean, in the example above one set with the images and one with the annotation texts?In Keras both fitting and prediction accept only one X data set, whereas in our case we should feed X1 and X2 as well. – Hendrik Aug 16 '16 at 14:31
  • I'll update my answer – Jan van der Vegt Aug 16 '16 at 14:59
  • Thank you, that is a very helpful addition with the multiple input syntax! In prediction it is supposed to be the following, right? model1.predict_classes([X1, X2]) – Hendrik Aug 17 '16 at 07:54
  • Yes, that is right – Jan van der Vegt Aug 17 '16 at 07:55
  • I suspect that this kind of merged-layer-models may be used for machine translation where the respective input layers feed in the source and destination sentences. But what is the combined output in this case? And how could we use the model in implementation where there is only source sentence? – Hendrik Mar 03 '17 at 11:00