Remembering the architecture we just presented, a simplified version of FCN-8s can be implemented as follows (note that the real model has additional convolutions before each transposed one):
inputs = Input(shape=(224, 224, 3))
# Building a pretrained VGG-16 feature extractor as encoder:
vgg16 = VGG16(include_top=False, weights='imagenet', input_tensor=inputs)
# We recover the feature maps returned by each of the 3 final blocks:
f3 = vgg16.get_layer('block3_pool').output # shape: (28, 28, 256)
f4 = vgg16.get_layer('block4_pool').output # shape: (14, 14, 512)
f5 = vgg16.get_layer('block5_pool').output # shape: ( 7, 7, 512)
# We replace the VGG dense layers by convs, adding the "decoding" layers instead after the conv/pooling blocks:
f3 = Conv2D(filters=out_ch, kernel_size=1, padding='same')(f3)
f4 = Conv2D(filters=out_ch, kernel_size=1, padding='same')(f4)
f5 = Conv2D(filters=out_ch, kernel_size=1, padding=...