Text generation using basic RNN architecture - Tensorflow tutorial
- 18 minsText generation with the RNN architecture
import tensorflow as tf
tf.enable_eager_execution()
Data
I downloaded some data from Wikipedia as it is publicly available to all users. I extracted some of it using a Python tool from GitHub, which is reachable under this link. I only used one data file since the internet is pretty slow where I live and I could upload the merged file. However after extracting, there is a neat technique on unix to merge text files:
cat files* >> merged_file
path_to_file = tf.keras.utils.get_file('wiki_00',
'http://olaralex.web.elte.hu/text/wiki_00')
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='latin-1')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))
Length of text: 1040052 characters
The tutorial is mainly a copy from TensorFlow’s site. It is pretty easy to start experimenting (RNN) but I would recommend starting with this explanatory video from MIT.
Printing the text, it can be observed that different Wikipedia articles are present.
print(text[:2500])
<doc id="1" url="https://simple.wikipedia.org/wiki?curid=1" title="April">
April
April is the fourth month of the year, and comes between March and May. It is one of four months to have 30 days.
April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.
April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.
April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.
April begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each others' last days are exactly 35 weeks (245 days) apart.
In common years, April starts on the same day of the week as October of the previous year, and in leap years, May of the previous year. In common years, April finishes on the same day of the week as July of the previous year, and in leap years, February and October of the previous year. In common years immediately after other common years, April starts on the same day of the week as January of the previous year, and in leap years and years immediately after that, April finishes on the same day of the week as January of the previous year.
In years immediately before common years, April starts on the same day of the week as September and December of the following year, and in years immediately before leap years, June of the following year. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.
April is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. In each hemisphere, it is the seasonal equivalent of October in the other.
It is unclear as to where April got its name. A common theory is that it comes from the Latin word "aperire", meaning "to open", referring to flowers opening in spring. Another theory is that the name could come from Aphrodite, the Greek goddess of love. It was originally the second month in the old Roman Calendar, before the start of the new year was put to January 1.
Quite a few festivals are held in this month. In many Southeast Asian cul
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))
171 unique characters
A recurrent neural network works by feeding it with a data sequence, each input after the other. By doind this the system is cabable to build memory and infer later on new sequences by the learnt patterns.
In order to do this one needs to convert data to numerical data thus create a sequence of integers out of the string input. To do this a vocabulary between unique characters to integers is needed and of course a dictionary to be able to later on turn back the numerical data to human readble format.
import numpy as np
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])
print('{')
for char,_ in zip(char2idx, range(20)):
print(' {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print(' ...\n}')
{
'\n': 0,
' ' : 1,
'!' : 2,
'"' : 3,
'$' : 4,
'%' : 5,
'&' : 6,
"'" : 7,
'(' : 8,
')' : 9,
'*' : 10,
'+' : 11,
',' : 12,
'-' : 13,
'.' : 14,
'/' : 15,
'0' : 16,
'1' : 17,
'2' : 18,
'3' : 19,
...
}
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))
'<doc id="1" u' ---- characters mapped to int ---- > [28 66 77 65 1 71 66 29 3 17 3 1 83]
TensorFlow Dataset
The advanced TensorFlow data libary is a tool to handle data efficiently. It uses the ETL paradime, Extract, Transform and Load.
I will feed the RNN with a 100 long sequence for training and create the dataset using the from_tensor_slices
method that is just the way of converting numpy, tf input to the Dataset library to handle.
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//seq_length
# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
for i in char_dataset.take(14):
print(idx2char[i.numpy()])
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py:532: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
<
d
o
c
i
d
=
"
1
"
u
r
Creating batches is just as easy.
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
for item in sequences.take(5):
print(repr(''.join(idx2char[item.numpy()])))
'<doc id="1" url="https://simple.wikipedia.org/wiki?curid=1" title="April">\nApril\n\nApril is the fourth'
' month of the year, and comes between March and May. It is one of four months to have 30 days.\n\nApril'
' always begins on the same day of week as July, and additionally, January in leap years. April always'
" ends on the same day of the week as December.\n\nApril's flowers are the Sweet Pea and Daisy. Its birt"
'hstone is the diamond. The meaning of the diamond is innocence.\n\nApril comes between March and May, m'
Here comes the Transforming part since we split the datasets 101 length sequences to two 100 length sequences which correspond in 99 characters but the target text contains the 100th caracter of the input text.
def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
for input_example, target_example in dataset.take(1):
print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))
Input data: '<doc id="1" url="https://simple.wikipedia.org/wiki?curid=1" title="April">\nApril\n\nApril is the fourt'
Target data: 'doc id="1" url="https://simple.wikipedia.org/wiki?curid=1" title="April">\nApril\n\nApril is the fourth'
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
print("Step {:4d}".format(i))
print(" input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
print(" expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
Step 0
input: 28 ('<')
expected output: 66 ('d')
Step 1
input: 66 ('d')
expected output: 77 ('o')
Step 2
input: 77 ('o')
expected output: 65 ('c')
Step 3
input: 65 ('c')
expected output: 1 (' ')
Step 4
input: 1 (' ')
expected output: 71 ('i')
Here a neat explanation is given by Google developers on BUFFER_SIZE and dataset.shuffle
method. It can be seen on dataset
s output that the dataset itself contains (64, 100) - (64, 100) sequences as batch_size * input/target sequences.
# Batch size
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE
# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset
<DatasetV1Adapter shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024
Keras implementation of a common RNN, GRU (gated gradient unit). RNNs have a common problem called the vanishing gradient problem where through the many iterations/recursions due to multiplication the systems gradients converge to 0 and therefore learning is not done properly. The common solution for this is LSTM or GRU where the gaol is to keep previous findings by intorducing gates and keeping the states of the system for longer periods.
if tf.test.is_gpu_available():
rnn = tf.keras.layers.CuDNNGRU
else:
import functools
rnn = functools.partial(
tf.keras.layers.GRU, recurrent_activation='sigmoid')
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
rnn(rnn_units,
return_sequences=True,
recurrent_initializer='glorot_uniform',
stateful=True),
tf.keras.layers.Dense(vocab_size)
])
return model
model = build_model(
vocab_size = len(vocab),
embedding_dim=embedding_dim,
rnn_units=rnn_units,
batch_size=BATCH_SIZE)
The vocabulary size is included in the example prediction for being able to recontruct values from a finite embedding space. If it was not given numbers could be chosen randomly to infinity.
for input_example_batch, target_example_batch in dataset.take(1):
example_batch_predictions = model(input_example_batch)
print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
(64, 100, 171) # (batch_size, sequence_length, vocab_size)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (64, None, 256) 43776
_________________________________________________________________
cu_dnngru (CuDNNGRU) (64, None, 1024) 3938304
_________________________________________________________________
dense (Dense) (64, None, 171) 175275
=================================================================
Total params: 4,157,355
Trainable params: 4,157,355
Non-trainable params: 0
_________________________________________________________________
Random sampling is implemented, here’s a random sequence is generated from a categorical distribution over the given input.
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices
array([ 36, 31, 35, 134, 111, 20, 42, 83, 79, 85, 77, 102, 164,
92, 64, 6, 26, 130, 102, 72, 5, 141, 112, 154, 6, 15,
25, 87, 166, 68, 14, 2, 130, 82, 36, 60, 32, 97, 167,
158, 40, 39, 125, 35, 121, 127, 158, 45, 158, 93, 95, 67,
23, 24, 4, 79, 168, 69, 92, 151, 123, 137, 149, 164, 108,
80, 36, 47, 46, 75, 148, 35, 60, 81, 1, 69, 89, 31,
129, 105, 35, 113, 99, 34, 164, 22, 19, 15, 36, 63, 140,
34, 4, 159, 47, 139, 39, 119, 32, 166])
The untrained network gives pretty shitty predictions at first.
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))
Input:
't Germany, Austria-Hungary and the Ottoman Empire. Australian soldiers were sent to Gallipoli, in th'
Next Char Predictions:
'E?D\xad\x954Kuqwo\x89á~b&:©\x89j%´\x96Ã&/9yåf.!©tE^A\x84æÉIH¤D\xa0¦ÉNÉ\x80\x82e78$qçg~¾¢°¼á\x92rEPOm»D^s g{?¨\x8eD\x97\x86Cá63/Ea³C$ÊP²H\x9eAå'
Sparse categorical cross entropy loss is used since the number of unique characters is fixed and it is therefore a categorical variable.
def loss(labels, logits):
return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss: ", example_batch_loss.numpy().mean())
Prediction shape: (64, 100, 171) # (batch_size, sequence_length, vocab_size)
scalar_loss: 5.14105
model.compile(
optimizer = tf.train.AdamOptimizer(),
loss = loss)
EPOCHS=1
history = model.fit(dataset.repeat(),
epochs=EPOCHS, steps_per_epoch=steps_per_epoch)
162/162 [==============================] - 24s 147ms/step - loss: 1.2676
def generate_text(model, start_string):
# Evaluation step (generating text using the learned model)
# Number of characters to generate
num_generate = 1000
# Converting our start string to numbers (vectorizing)
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
# Empty string to store our results
text_generated = []
# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.
temperature = 1.0
# Here batch size == 1
model.reset_states()
for i in range(num_generate):
predictions = model(input_eval)
# remove the batch dimension
predictions = tf.squeeze(predictions, 0)
# using a multinomial distribution to predict the word returned by the model
predictions = predictions / temperature
predicted_id = tf.multinomial(predictions, num_samples=1)[-1,0].numpy()
# We pass the predicted word as the next input to the model
# along with the previous hidden state
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])
return (start_string + ''.join(text_generated))
I creating a model with a batch_size 1 and loading it with the trained weights.
model_ = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
weights = model.get_weights()
model_.set_weights(weights)
Strarting with a common starting string, this can be guessed by looking at the input data.
print(generate_text(model_, start_string=u"<doc id="))
<doc id="366" url="https://simple.wikipedia.org/wiki?curid=561" title="Sab
Mauit
Society has many crulded the Chinost Dissian Sodat Mingate present, a social formal powers are called "busto "difficulture", life "thents: [[Bebrier Revolution, they are complex for written in northeastern) would numbers and politics normally to do ofpirsuage on Fear'. Out of Cornainch 2 and Walwabjece are at this first poper. It is not re-cannot in science. Ar s much wordd's any country. Electronic western Capital amin March 21 new state is named a both to SUring social computers for support forces. In speciess:
The autonous city in creating is named Widdings, and that the numberol other festival.
</doc>
<doc id="515" url="https://simple.wikipedia.org/wiki?curid=100" title="Earlly">
Caladianious piece" of 20Â .
</doc>
<doc id="125" url="https://simple.wikipedia.org/wiki?curid=613" title="Shepeanism">
Alagaiis">
Earth (which stored is October informediate:) differces and one that well the flew feutip of mu
Finally we can see that the outputted text is at least human readable but doesn’t make sense. I’ll try to upload more data and experiment on many more articles to be able to generate better texts. I’ll look into how to improve this model significantly and maybe train it a little bit longer.