Hello, i'm trying to code a neural network (NN) that can convolve a signal with a room impulse respons (RIR) to obtain an echo signal as a proof of concept. I'm using the TIMIT database and I've pre-processed it so that all signals contained within are the same length by adding zeros to the end of the signals until they match the largest signal.
So far I've used a simple NN of 5 layers and less, going from just a convolution layer in between an image input layer and a regression output layer to including a batchNorm and a reluLayer. I can't use more than one convolution layer or the signals are reduced to zero. In terms of results, with just the convolution layer the signal doesn't appear to change beyond reducing its magnitude immensely. I can barely hear it even with headphones when I reproduce with the sound function. If I include more layers, the signal is distorted, ending more as a noisy signal rather than an echo signal.
Could someone help me, please?