<AI>Devspace

In LSTM text generation can low amount of training data be compensated?

clock icon
asked 1 week ago
message icon
2
eye icon
1.3K

I'm playing with an LSTM to generate text. In particular, this one:

https://raw.githubusercontent.com/fchollet/keras/master/examples/lstm_text_generation.py

It works on quite a big demo text set from Nietzsche and says

If you try this script on new data, make sure your corpus has at least ~100k characters. ~1M is better.

This pops up a couple of questions.

A.) If all I want is an AI with a very limited vocabulary where the generate text should be short sentences following a basic pattern.

E.g.

I like blue sky with white clouds

I like yellow fields with some trees

I like big cities with lots of bars

...

Would it then be reasonable to use a much much smaller dataset?

B.) If the dataset really needs to be that big. What if I just repeat the text over and over to reach the recommended minimum? If that would work though, I'd be wondering how that is any different from just taking more iterations of learning with the same shorter text?

Obviously I can play with these two questions myself and in fact I am experimenting with it. One thing I already figured out is that with a shorter text following a basic pattern I can get to a very very low ( ~0.04) quite fast but the predicted text just turns out as gibberish.

My naive explanation for that would be that there are just not enough samples to proof against whether the gibberish actually makes sense or not? But then again I wonder if more iterations or duplicating the content would actually help.

I'm trying to experiment with these questions myself so please don't think I'm just too lazy and are aiming for others to do the work. I'm just looking for more experienced people to give me a better understanding of the mechanics that influence these things.

2 Answers

A.) The algorithm can only learn patterns from the data, so if you want it to align to a specific functional form of language, it would make sense to curate your data to only contain sentences of that very structure.

If you tune your hyper-parameters such that the model is not as complex as the default, it should work for datasets that are smaller. But this relies on the observation that the thing you are trying to learn isn't as complex (as opposed to learning from a large corpus of Nietzsche).

B.) What is happening is that you are overfitting the data, such that the LSTM isn't generalizing to your intended goal. In essence, overfitting means that your model is learning irrelevant details that by chance happen to predict the intended goal in the training data. To alleviate this, use a validation set.

And finally (but very importantly), these machine learning techniques don't learn the same way as humans do. To understand why you'd need to learn the math behind NNs, and I can't think of an intuitive explanation as to why.

1

Write your answer here