🔥Machine Learning Course🔥
Neutral Network Processing
NLP là một chủ đề rất hot trong thời gian vừa qua, theo các chuyên gia thì 2020 là năm của NLP khi mà hàng loạt các công trình quan trọng đồng loạt được công bố, cải thiện đáng kể hiệu quả của việc đọc hiểu ngôn ngữ tự nhiên. Bài viết này tập trung vào việc trang bị cho bạn hiểu biết tổng quan về xử lí ngôn ngữ tự nhiên, bạn hoàn toàn có thể FROM ZERO TO HERO với những kiến thức ở trong bài viết này.
Contents
Trong bài viết này tôi bắt đầu với những kiến thức rất cơ bản về RNN và sử dụng những kiến thức hiện đại nhất thời điểm hiện nay. Nội dung bài viết bao gồm:
- Simple RNN’s(mạng hồi tiếp đơn giản)
- Word Embeddings : Đinh nghĩa và cách sử dụng
- LSTM’s
- GRU’s
- BI-Directional RNN’s
- Encoder-Decoder Models (Seq2Seq Models)
- Attention Models (Cơ chế chú ý)
- Transformers - Attention is all you need
- BERT
Tôi chia mỗi chủ đề theo cấu trúc như sau:
- Tổng quan cơ bản
- Hiểu sâu hơn : Tôi sẽ dẫn các đường link để bạn tự tìm hiểu.
- Code-Implementation
- Giải thích Code
Đây là một bài viết tâm huyết và tôi hứa với bạn sẽ học được tất cả các công nghệ hoàn toàn với nó.
**Bài viết này cần rất nhiều nỗ lực, vui lòng like và share nếu bạn cảm thấy nó hữu ích**
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
Using TensorFlow backend.
Configuring TPU’s
TPU là phần cứng cho phép tính toán song song, được tối ưu hoàn toàn cho Deep Learning từ Google. Bài viết này sử dụng TPU với Tensorflow để xây dựng BERT model.
# Detect hardware, return appropriate distribution strategy
try:
# TPU detection. No parameters necessary if TPU_NAME environment variable is
# set: this is always the case on Kaggle.
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
tpu = None
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
# Default distribution strategy in Tensorflow. Works on CPU and single GPU.
strategy = tf.distribute.get_strategy()
print("REPLICAS: ", strategy.num_replicas_in_sync)
Running on TPU grpc://10.0.0.2:8470
REPLICAS: 8
train = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
We will drop the other columns and approach this problem as a Binary Classification Problem and also we will have our exercise done on a smaller subsection of the dataset(only 12000 data points) to make it easier to train the models
train.drop(['severe_toxic','obscene','threat','insult','identity_hate'],axis=1,inplace=True)
train = train.loc[:12000,:]
train.shape
(12001, 3)
We will check the maximum number of words that can be present in a comment , this will help us in padding later
train['comment_text'].apply(lambda x:len(str(x).split())).max()
1403
Writing a function for getting auc score for validation
def roc_auc(predictions,target):
'''
This methods returns the AUC Score when given the Predictions
and Labels
'''
fpr, tpr, thresholds = metrics.roc_curve(target, predictions)
roc_auc = metrics.auc(fpr, tpr)
return roc_auc
Data Preparation
xtrain, xvalid, ytrain, yvalid = train_test_split(train.comment_text.values, train.toxic.values,
stratify=train.toxic.values,
random_state=42,
test_size=0.2, shuffle=True)
Trước khi bắt đầu
Trước khi chúng ta bắt đầu, nếu bạn hoàn toàn mới tìm hiểu về NLP, vui lòng đọc những kernel sau để bắt đầu con đường tìm hiểu ngôn ngữ tự nhiên cùng chúng tôi:
- https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial
- https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
Nếu bạn muốn bắt đầu bằng một cách cơ bản hơn, đây là một sự lựa chọn tốt:
- https://www.kaggle.com/tanulsingh077/what-s-cooking
Dưới đây là những tài nguyên cơ bản để bắt đầu với những kiến thức cơ bản về mạng thần kinh nhân tạo, Chúng sẽ giúp bạn hiểu các phần tiếp theo:
- https://www.youtube.com/watch?v=aircAruvnKk&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv
- https://www.youtube.com/watch?v=IHZwWFHWa-w&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=2
- https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=3
- https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=4
Để học cách trực quan hóa dữ liệu, vui lòng tham khảo:
- https://www.kaggle.com/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model
- https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda
Simple RNN
Basic Overview
What is a RNN?
Recurrent Neural Network(RNN) là một loại Neural Network khi mà đầu ta của bước phía trược là đầu vào của bước tiếp theo. Trong một mạng thần kinh cổ điển, tất cả đầu vào và đầu ra độc lập với nhau, nhưng trong trường hợp này khi bạn cần dự đoán từ tiếp theo trong câu, những từ phía trước là cần thiết và việc ghi nhớ nó là bắt buộc. RNN giải quyết vấn đề này, giúp cho NN có thể liên kết các từ đầu vào tốt hơn.
Why RNN’s?
https://www.quora.com/Why-do-we-use-an-RNN-instead-of-a-simple-neural-network
In-Depth Understanding
- https://medium.com/mindorks/understanding-the-recurrent-neural-network-44d593f112a2
- https://www.youtube.com/watch?v=2E65LDnM2cA&list=PL1F3ABbhcqa3BBWo170U4Ev2wfsF7FN8l
- https://www.d2l.ai/chapter_recurrent-neural-networks/rnn.html
Code Implementation
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 1500
token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)
#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)
word_index = token.word_index
%%time
with strategy.scope():
# A simpleRNN without any pretrained embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
input_length=max_len))
model.add(SimpleRNN(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 1500, 300) 13049100
_________________________________________________________________
simple_rnn_1 (SimpleRNN) (None, 100) 40100
_________________________________________________________________
dense_1 (Dense) (None, 1) 101
=================================================================
Total params: 13,089,301
Trainable params: 13,089,301
Non-trainable params: 0
_________________________________________________________________
CPU times: user 620 ms, sys: 370 ms, total: 990 ms
Wall time: 1.18 s
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync) #Multiplying by Strategy to run on TPU's
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning:
The `nb_epoch` argument in `fit` has been renamed `epochs`.
Epoch 1/5
9600/9600 [==============================] - 39s 4ms/step - loss: 0.3714 - accuracy: 0.8805
Epoch 2/5
9600/9600 [==============================] - 39s 4ms/step - loss: 0.2858 - accuracy: 0.9055
Epoch 3/5
9600/9600 [==============================] - 40s 4ms/step - loss: 0.2748 - accuracy: 0.8945
Epoch 4/5
9600/9600 [==============================] - 40s 4ms/step - loss: 0.2416 - accuracy: 0.9053
Epoch 5/5
9600/9600 [==============================] - 39s 4ms/step - loss: 0.2109 - accuracy: 0.9079
<keras.callbacks.callbacks.History at 0x7fae866d75c0>
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.69%
scores_model = []
scores_model.append({'Model': 'SimpleRNN','AUC_Score': roc_auc(scores,yvalid)})
Code Explanantion
- Tokenization
Vậy nếu bạn đã xem video hoặc link chúng tôi gợi ý, bạn sẽ thấy đầu vào của RNN là một câu với các từ liên tiếp nhau. Chúng tôi đại diện mỗi từ bằng một vector one-hot với số chiều là: (Số từ trong từ điển)x1.
What keras Tokenizer does is , it takes all the unique words in the corpus,forms a dictionary with words as keys and their number of occurences as values,it then sorts the dictionary in descending order of counts. It then assigns the first value 1 , second value 2 and so on. So let’s suppose word ‘the’ occured the most in the corpus then it will assigned index 1 and vector representing ‘the’ would be a one-hot vector with value 1 at position 1 and rest zereos.
Try printing first 2 elements of xtrain_seq you will see every word is represented as a digit now
xtrain_seq[:1]
[[664,
65,
7,
19,
2262,
14102,
5,
2262,
20439,
6071,
4,
71,
32,
20440,
6620,
39,
6,
664,
65,
11,
8,
20441,
1502,
38,
6072]]
Now you might be wondering What is padding? Why its done
Here is the answer :
- https://www.quora.com/Which-effect-does-sequence-padding-have-on-the-training-of-a-neural-network
- https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/
- https://www.coursera.org/lecture/natural-language-processing-tensorflow/padding-2Cyzs
Also sometimes people might use special tokens while tokenizing like EOS(end of string) and BOS(Begining of string). Here is the reason why it’s done
- https://stackoverflow.com/questions/44579161/why-do-we-do-padding-in-nlp-tasks
The code token.word_index simply gives the dictionary of vocab that keras created for us
- Building the Neural Network
Để hiểu đầu vào và đầu ra của RNN vui lòng xem qua một bài viết rất thú vị sau: https://medium.com/@shivajbd/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e
The first line model.Sequential() tells keras that we will be building our network sequentially . Then we first add the Embedding layer. Embedding layer is also a layer of neurons which takes in as input the nth dimensional one hot vector of every word and converts it into 300 dimensional vector , it gives us word embeddings similar to word2vec. We could have used word2vec but the embeddings layer learns during training to enhance the embeddings. Next we add an 100 LSTM units without any dropout or regularization At last we add a single neuron with sigmoid function which takes output from 100 LSTM cells (Please note we have 100 LSTM cells not layers) to predict the results and then we compile the model using adam optimizer
- Comments on the model
We can see our model achieves an accuracy of 1 which is just insane , we are clearly overfitting I know , but this was the simplest model of all ,we can tune a lot of hyperparameters like RNN units, we can do batch normalization , dropouts etc to get better result. The point is we got an AUC score of 0.82 without much efforts and we know have learnt about RNN’s .Deep learning is really revolutionary
Word Embeddings
Khi chúng ta xây dựng mô hình RNN, chúng ta phải sử dụng word-embeddings, Vậy word-embeding là già, và cách xây dựng nó là gì? Here is the answer :
- https://www.coursera.org/learn/nlp-sequence-models/lecture/6Oq70/word-representation
- https://machinelearningmastery.com/what-are-word-embeddings/
The latest approach to getting word Embeddings is using pretained GLoVe or using Fasttext. Without going into too much details, I would explain how to create sentence vectors and how can we use them to create a machine learning model on top of it and since I am a fan of GloVe vectors, word2vec and fasttext. In this Notebook, I’ll be using the GloVe vectors. You can download the GloVe vectors from here http://www-nlp.stanford.edu/data/glove.840B.300d.zip or you can search for GloVe in datasets on Kaggle and add the file
# load the GloVe vectors in a dictionary:
embeddings_index = {}
f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
values = line.split(' ')
word = values[0]
coefs = np.asarray([float(val) for val in values[1:]])
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
2196018it [06:43, 5439.09it/s]
Found 2196017 word vectors.
LSTM’s
Basic Overview
Simple RNN’s were certainly better than classical ML algorithms and gave state of the art results, but it failed to capture long term dependencies that is present in sentences . So in 1998-99 LSTM’s were introduced to counter to these drawbacks.
In Depth Understanding
Why LSTM’s?
- https://www.coursera.org/learn/nlp-sequence-models/lecture/PKMRR/vanishing-gradients-with-rnns
- https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/
What are LSTM’s?
- https://www.coursera.org/learn/nlp-sequence-models/lecture/KXoay/long-short-term-memory-lstm
- https://distill.pub/2019/memorization-in-rnns/
- https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
Code Implementation
We have already tokenized and paded our text for input to LSTM’s
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
100%|██████████| 43496/43496 [00:00<00:00, 183357.18it/s]
%%time
with strategy.scope():
# A simple LSTM with glove embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 1500, 300) 13049100
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 160400
_________________________________________________________________
dense_2 (Dense) (None, 1) 101
=================================================================
Total params: 13,209,601
Trainable params: 160,501
Non-trainable params: 13,049,100
_________________________________________________________________
CPU times: user 1.33 s, sys: 1.46 s, total: 2.79 s
Wall time: 3.09 s
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning:
The `nb_epoch` argument in `fit` has been renamed `epochs`.
Epoch 1/5
9600/9600 [==============================] - 117s 12ms/step - loss: 0.3525 - accuracy: 0.8852
Epoch 2/5
9600/9600 [==============================] - 114s 12ms/step - loss: 0.2397 - accuracy: 0.9192
Epoch 3/5
9600/9600 [==============================] - 114s 12ms/step - loss: 0.1904 - accuracy: 0.9333
Epoch 4/5
9600/9600 [==============================] - 114s 12ms/step - loss: 0.1659 - accuracy: 0.9394
Epoch 5/5
9600/9600 [==============================] - 114s 12ms/step - loss: 0.1553 - accuracy: 0.9470
<keras.callbacks.callbacks.History at 0x7fae84dac710>
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.96%
scores_model.append({'Model': 'LSTM','AUC_Score': roc_auc(scores,yvalid)})
Code Explanation
As a first step we calculate embedding matrix for our vocabulary from the pretrained GLoVe vectors . Then while building the embedding layer we pass Embedding Matrix as weights to the layer instead of training it over Vocabulary and thus we pass trainable = False. Rest of the model is same as before except we have replaced the SimpleRNN By LSTM Units
- Comments on the Model
We now see that the model is not overfitting and achieves an auc score of 0.96 which is quite commendable , also we close in on the gap between accuracy and auc . We see that in this case we used dropout and prevented overfitting the data
GRU’s
Basic Overview
Introduced by Cho, et al. in 2014, GRU (Gated Recurrent Unit) sử dụng để giải quyết vấn đề mất mát gradient. GRU’s are a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results . GRU’s were designed to be simpler and faster than LSTM’s and in most cases produce equally good results and thus there is no clear winner.
In Depth Explanation
- https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be
- https://www.coursera.org/learn/nlp-sequence-models/lecture/agZiL/gated-recurrent-unit-gru
- https://www.geeksforgeeks.org/gated-recurrent-unit-networks/
Code Implementation
%%time
with strategy.scope():
# GRU with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(300))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_3 (Embedding) (None, 1500, 300) 13049100
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 1500, 300) 0
_________________________________________________________________
gru_1 (GRU) (None, 300) 540900
_________________________________________________________________
dense_3 (Dense) (None, 1) 301
=================================================================
Total params: 13,590,301
Trainable params: 541,201
Non-trainable params: 13,049,100
_________________________________________________________________
CPU times: user 1.3 s, sys: 1.29 s, total: 2.59 s
Wall time: 2.79 s
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning:
The `nb_epoch` argument in `fit` has been renamed `epochs`.
Epoch 1/5
9600/9600 [==============================] - 191s 20ms/step - loss: 0.3272 - accuracy: 0.8933
Epoch 2/5
9600/9600 [==============================] - 189s 20ms/step - loss: 0.2015 - accuracy: 0.9334
Epoch 3/5
9600/9600 [==============================] - 189s 20ms/step - loss: 0.1540 - accuracy: 0.9483
Epoch 4/5
9600/9600 [==============================] - 189s 20ms/step - loss: 0.1287 - accuracy: 0.9548
Epoch 5/5
9600/9600 [==============================] - 188s 20ms/step - loss: 0.1238 - accuracy: 0.9551
<keras.callbacks.callbacks.History at 0x7fae5b01ed30>
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.97%
scores_model.append({'Model': 'GRU','AUC_Score': roc_auc(scores,yvalid)})
scores_model
[{'Model': 'SimpleRNN', 'AUC_Score': 0.6949714081921305},
{'Model': 'LSTM', 'AUC_Score': 0.9598235453841757},
{'Model': 'GRU', 'AUC_Score': 0.9716554069114769}]
Bi-Directional RNN’s
In Depth Explanation
- https://www.coursera.org/learn/nlp-sequence-models/lecture/fyXnn/bidirectional-rnn
- https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66
- https://d2l.ai/chapter_recurrent-modern/bi-rnn.html
Code Implementation
%%time
with strategy.scope():
# A simple bidirectional LSTM with glove embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_4 (Embedding) (None, 1500, 300) 13049100
_________________________________________________________________
bidirectional_1 (Bidirection (None, 600) 1442400
_________________________________________________________________
dense_4 (Dense) (None, 1) 601
=================================================================
Total params: 14,492,101
Trainable params: 1,443,001
Non-trainable params: 13,049,100
_________________________________________________________________
CPU times: user 2.39 s, sys: 1.62 s, total: 4 s
Wall time: 3.41 s
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning:
The `nb_epoch` argument in `fit` has been renamed `epochs`.
Epoch 1/5
9600/9600 [==============================] - 322s 34ms/step - loss: 0.3171 - accuracy: 0.9009
Epoch 2/5
9600/9600 [==============================] - 318s 33ms/step - loss: 0.1988 - accuracy: 0.9305
Epoch 3/5
9600/9600 [==============================] - 318s 33ms/step - loss: 0.1650 - accuracy: 0.9424
Epoch 4/5
9600/9600 [==============================] - 318s 33ms/step - loss: 0.1577 - accuracy: 0.9414
Epoch 5/5
9600/9600 [==============================] - 319s 33ms/step - loss: 0.1540 - accuracy: 0.9459
<keras.callbacks.callbacks.History at 0x7fae5a4ade48>
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
Auc: 0.97%
scores_model.append({'Model': 'Bi-directional LSTM','AUC_Score': roc_auc(scores,yvalid)})
Code Explanation
Code is same as before,only we have added bidirectional nature to the LSTM cells we used before and is self explanatory. We have achieve similar accuracy and auc score as before and now we have learned all the types of typical RNN architectures
We are now at the end of part 1 of this notebook and things are about to go wild now as we Enter more complex and State of the art models .If you have followed along from the starting and read all the articles and understood everything , these complex models would be fairly easy to understand.I recommend Finishing Part 1 before continuing as the upcoming techniques can be quite overwhelming
Seq2Seq Model Architecture
Overview
RNN’s are of many types and different architectures are used for different purposes. Here is a nice video explanining different types of model architectures : https://www.coursera.org/learn/nlp-sequence-models/lecture/BO8PS/different-types-of-rnns. Seq2Seq is a many to many RNN architecture where the input is a sequence and the output is also a sequence (where input and output sequences can be or cannot be of different lengths). This architecture is used in a lot of applications like Machine Translation, text summarization, question answering etc
In Depth Understanding
I will not write the code implementation for this,but rather I will provide the resources where code has already been implemented and explained in a much better way than I could have ever explained.
-
https://www.coursera.org/learn/nlp-sequence-models/lecture/HyEui/basic-models —> A basic idea of different Seq2Seq Models
-
https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html , https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/ —> Basic Encoder-Decoder Model and its explanation respectively
-
https://towardsdatascience.com/how-to-implement-seq2seq-lstm-model-in-keras-shortcutnlp-6f355f3e5639 —> A More advanced Seq2seq Model and its explanation
-
https://d2l.ai/chapter_recurrent-modern/machine-translation-and-dataset.html , https://d2l.ai/chapter_recurrent-modern/encoder-decoder.html —> Implementation of Encoder-Decoder Model from scratch
-
https://www.youtube.com/watch?v=IfsjMg4fLWQ&list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9&index=8&t=0s —> Introduction to Seq2seq By fast.ai
# Visualization of Results obtained from various Deep learning models
results = pd.DataFrame(scores_model).sort_values(by='AUC_Score',ascending=False)
results.style.background_gradient(cmap='Blues')
Model | AUC_Score | |
---|---|---|
2 | GRU | 0.971655 |
3 | Bi-directional LSTM | 0.966693 |
1 | LSTM | 0.959824 |
0 | SimpleRNN | 0.694971 |
fig = go.Figure(go.Funnelarea(
text =results.Model,
values = results.AUC_Score,
title = {"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"}
))
fig.show()
Attention Models
Đây là phần giá trị và hấp dẫn nhất của bài viết. Nếu bạn hiểu cách vận hành của attention block, understanding transformers and transformer based architectures như BERT thì sẽ thật dễ hiểu, nhưng nếu không thì cũng không sao, tôi sẽ chia sẻ cho bạn nhiều nguồn để bù đắp vấn đề này. :-
- https://www.coursera.org/learn/nlp-sequence-models/lecture/RDXpX/attention-model-intuition –> Only watch this video and not the next one
- https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a
- https://towardsdatascience.com/attention-and-its-different-forms-7fc3674d14dc
- https://distill.pub/2016/augmented-rnns/
Code Implementation
- https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/ –> Basic Level
- https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html —> Implementation from Scratch in Pytorch
Transformers : Attention is all you need
So finally we have reached the end of the learning curve and are about to start learning the technology that changed NLP completely and are the reasons for the state of the art NLP techniques .Transformers were introduced in the paper Attention is all you need by Google. If you have understood the Attention models,this will be very easy , Here is transformers fully explained:
- http://jalammar.github.io/illustrated-transformer/
Code Implementation
- http://nlp.seas.harvard.edu/2018/04/03/attention.html —> This presents the code implementation of the architecture presented in the paper by Google
BERT and Its Implementation on this Competition
Tôi chắc chắn rằng tài liệu sau sẽ giúp bạn hiểu hơn về BERT là kiến trúc NLP phổ biến nhất hiện tại:-
- http://jalammar.github.io/illustrated-bert/ —> In Depth Understanding of BERT
Sau khi đi qua bài viết trên, tôi chắc rằng bạn đã hiểu về transformer. Chúng được dùng theo hai cách sau đây :
1) Sử dụng model được huấn luyện trước mà không đào tạo lại
- EG: http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/ —> Using Pre-trained BERT without Tuning
2) Sử dụng mô hình đào tạo trước để fine-tuning cho một vấn đề bé hơn
- EG:* https://www.youtube.com/watch?v=hinZO–TEk4&t=2933s —> Tuning BERT For your TASK
We will be using the first example as a base for our implementation of BERT model using Hugging Face and KERAS , but contrary to first example we will also Fine-Tune our model for our task
Acknowledgements : https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
Các bước thực hiện:
- Chuẩn bị dữ liệu : Tokenization and encoding of data
- Cấu hình TPU
- Tạo model và mạng NN
- Train model và lấy kết quả
# Loading Dependencies
import os
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from kaggle_datasets import KaggleDatasets
import transformers
from tokenizers import BertWordPieceTokenizer
# LOADING THE DATA
train1 = pd.read_csv("/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")
valid = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
sub = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')
Encoder FOr DATA for understanding waht encode batch does read documentation of hugging face tokenizer : https://huggingface.co/transformers/main_classes/tokenizer.html here
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
"""
Encoder for encoding the text into sequence of integers for BERT Input
"""
tokenizer.enable_truncation(max_length=maxlen)
tokenizer.enable_padding(max_length=maxlen)
all_ids = []
for i in tqdm(range(0, len(texts), chunk_size)):
text_chunk = texts[i:i+chunk_size].tolist()
encs = tokenizer.encode_batch(text_chunk)
all_ids.extend([enc.ids for enc in encs])
return np.array(all_ids)
#IMP DATA FOR CONFIG
AUTO = tf.data.experimental.AUTOTUNE
# Configuration
EPOCHS = 3
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 192
Tokenization
For understanding please refer to hugging face documentation again
# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
fast_tokenizer
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…
Tokenizer(vocabulary_size=119547, model=BertWordPiece, add_special_tokens=True, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=False, wordpieces_prefix=##)
x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN)
y_train = train1.toxic.values
y_valid = valid.toxic.values
100%|██████████| 874/874 [00:35<00:00, 24.35it/s]
100%|██████████| 32/32 [00:01<00:00, 20.87it/s]
100%|██████████| 250/250 [00:11<00:00, 22.06it/s]
train_dataset = (
tf.data.Dataset
.from_tensor_slices((x_train, y_train))
.repeat()
.shuffle(2048)
.batch(BATCH_SIZE)
.prefetch(AUTO)
)
valid_dataset = (
tf.data.Dataset
.from_tensor_slices((x_valid, y_valid))
.batch(BATCH_SIZE)
.cache()
.prefetch(AUTO)
)
test_dataset = (
tf.data.Dataset
.from_tensor_slices(x_test)
.batch(BATCH_SIZE)
)
def build_model(transformer, max_len=512):
"""
function for training the BERT model
"""
input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
sequence_output = transformer(input_word_ids)[0]
cls_token = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(cls_token)
model = Model(inputs=input_word_ids, outputs=out)
model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
return model
Starting Training
%%time
with strategy.scope():
transformer_layer = (
transformers.TFDistilBertModel
.from_pretrained('distilbert-base-multilingual-cased')
)
model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=618.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=910749124.0, style=ProgressStyle(descri…
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_word_ids (InputLayer) [(None, 192)] 0
_________________________________________________________________
tf_distil_bert_model (TFDist ((None, 192, 768),) 134734080
_________________________________________________________________
tf_op_layer_strided_slice (T [(None, 768)] 0
_________________________________________________________________
dense (Dense) (None, 1) 769
=================================================================
Total params: 134,734,849
Trainable params: 134,734,849
Non-trainable params: 0
_________________________________________________________________
CPU times: user 34.4 s, sys: 13.3 s, total: 47.7 s
Wall time: 50.8 s
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
train_dataset,
steps_per_epoch=n_steps,
validation_data=valid_dataset,
epochs=EPOCHS
)
Train for 1746 steps, validate for 63 steps
Epoch 1/3
1746/1746 [==============================] - 255s 146ms/step - loss: 0.1221 - accuracy: 0.9517 - val_loss: 0.4484 - val_accuracy: 0.8479
Epoch 2/3
1746/1746 [==============================] - 198s 114ms/step - loss: 0.0908 - accuracy: 0.9634 - val_loss: 0.4769 - val_accuracy: 0.8491
Epoch 3/3
1746/1746 [==============================] - 198s 113ms/step - loss: 0.0775 - accuracy: 0.9680 - val_loss: 0.5522 - val_accuracy: 0.8500
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 = model.fit(
valid_dataset.repeat(),
steps_per_epoch=n_steps,
epochs=EPOCHS*2
)
Train for 62 steps
Epoch 1/6
62/62 [==============================] - 18s 291ms/step - loss: 0.3244 - accuracy: 0.8613
Epoch 2/6
62/62 [==============================] - 25s 401ms/step - loss: 0.2354 - accuracy: 0.8955
Epoch 3/6
62/62 [==============================] - 7s 110ms/step - loss: 0.1718 - accuracy: 0.9252
Epoch 4/6
62/62 [==============================] - 7s 111ms/step - loss: 0.1210 - accuracy: 0.9492
Epoch 5/6
62/62 [==============================] - 7s 114ms/step - loss: 0.0798 - accuracy: 0.9686
Epoch 6/6
62/62 [==============================] - 7s 110ms/step - loss: 0.0765 - accuracy: 0.9696
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)
499/499 [==============================] - 41s 82ms/step
End Notes
Một số tài liệu tham khảo hữu ích:
1) Books
- https://d2l.ai/
- Jason Brownlee’s Books
2) Courses
- https://www.coursera.org/learn/nlp-sequence-models/home/welcome
- Fast.ai NLP Course
3) Blogs and websites
- Machine Learning Mastery
- https://distill.pub/
- http://jalammar.github.io/
**This is subtle effort of contributing towards the community, if it helped you in any way please show a token of love by like it**
Các giảng viên của khoá học:
- 💡Trần Đức Mạnh