The underlying abstraction (which is essentially what you'd be using the first network for) is that of reducing the state-space of the raw input via feature extraction/synthesis and/or dimensionality reduction.
At present, there are few definite rules for doing this: practice is more a question of 'informed trial and error'.
If you add some information to your question regarding what has been previously attempted in this area (e.g. on the ALE platform), this it might be possible to offer some more specific advice.