As far as I can see, there's no reason why you couldn't (for example) take the convolutional inputs to deepdream from adjacent sample points, rather than adjacent spatial positions, as is the case with image input.
Given the 'self similar' nature of deep dream images, listening to this fractal granular synthesis technique might be of interest/inspiration.