Improving Perfomance of LSTM on text classification (3-class problem)
My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:
data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')
X = data['text']
Y = data['polarity']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) #split train/test data
batch_size = 64
epochs=5
max_len = 500
max_words=5000
tokenizer = Tokenizer(max_words)
tokenizer.fit_on_texts(x_train)
x_train= tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train=np.array(x_train)
x_test=np.array(x_test)
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)
temp=np.array(y_test)
# create the model
embedding_vecor_length = 64
model = Sequential()
model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)
#load the saved model
print("Loading Best Model Overall")
model.load_weights("weights.best.hdf5")
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
machine-learning deep-learning lstm sentiment-analysis text-classification
|
show 1 more comment
My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:
data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')
X = data['text']
Y = data['polarity']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) #split train/test data
batch_size = 64
epochs=5
max_len = 500
max_words=5000
tokenizer = Tokenizer(max_words)
tokenizer.fit_on_texts(x_train)
x_train= tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train=np.array(x_train)
x_test=np.array(x_test)
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)
temp=np.array(y_test)
# create the model
embedding_vecor_length = 64
model = Sequential()
model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)
#load the saved model
print("Loading Best Model Overall")
model.load_weights("weights.best.hdf5")
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
machine-learning deep-learning lstm sentiment-analysis text-classification
a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less
– Biswadip Mandal
Nov 23 '18 at 7:38
The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.
– hiphopdance 444
Nov 23 '18 at 10:18
yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network
– Biswadip Mandal
Nov 23 '18 at 11:24
I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?
– hiphopdance 444
Nov 23 '18 at 12:36
1
Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp
– Biswadip Mandal
Nov 23 '18 at 12:54
|
show 1 more comment
My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:
data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')
X = data['text']
Y = data['polarity']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) #split train/test data
batch_size = 64
epochs=5
max_len = 500
max_words=5000
tokenizer = Tokenizer(max_words)
tokenizer.fit_on_texts(x_train)
x_train= tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train=np.array(x_train)
x_test=np.array(x_test)
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)
temp=np.array(y_test)
# create the model
embedding_vecor_length = 64
model = Sequential()
model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)
#load the saved model
print("Loading Best Model Overall")
model.load_weights("weights.best.hdf5")
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
machine-learning deep-learning lstm sentiment-analysis text-classification
My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:
data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')
X = data['text']
Y = data['polarity']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) #split train/test data
batch_size = 64
epochs=5
max_len = 500
max_words=5000
tokenizer = Tokenizer(max_words)
tokenizer.fit_on_texts(x_train)
x_train= tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train=np.array(x_train)
x_test=np.array(x_test)
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)
temp=np.array(y_test)
# create the model
embedding_vecor_length = 64
model = Sequential()
model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)
#load the saved model
print("Loading Best Model Overall")
model.load_weights("weights.best.hdf5")
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
machine-learning deep-learning lstm sentiment-analysis text-classification
machine-learning deep-learning lstm sentiment-analysis text-classification
asked Nov 22 '18 at 23:59
hiphopdance 444hiphopdance 444
71
71
a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less
– Biswadip Mandal
Nov 23 '18 at 7:38
The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.
– hiphopdance 444
Nov 23 '18 at 10:18
yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network
– Biswadip Mandal
Nov 23 '18 at 11:24
I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?
– hiphopdance 444
Nov 23 '18 at 12:36
1
Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp
– Biswadip Mandal
Nov 23 '18 at 12:54
|
show 1 more comment
a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less
– Biswadip Mandal
Nov 23 '18 at 7:38
The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.
– hiphopdance 444
Nov 23 '18 at 10:18
yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network
– Biswadip Mandal
Nov 23 '18 at 11:24
I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?
– hiphopdance 444
Nov 23 '18 at 12:36
1
Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp
– Biswadip Mandal
Nov 23 '18 at 12:54
a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less
– Biswadip Mandal
Nov 23 '18 at 7:38
a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less
– Biswadip Mandal
Nov 23 '18 at 7:38
The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.
– hiphopdance 444
Nov 23 '18 at 10:18
The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.
– hiphopdance 444
Nov 23 '18 at 10:18
yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network
– Biswadip Mandal
Nov 23 '18 at 11:24
yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network
– Biswadip Mandal
Nov 23 '18 at 11:24
I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?
– hiphopdance 444
Nov 23 '18 at 12:36
I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?
– hiphopdance 444
Nov 23 '18 at 12:36
1
1
Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp
– Biswadip Mandal
Nov 23 '18 at 12:54
Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp
– Biswadip Mandal
Nov 23 '18 at 12:54
|
show 1 more comment
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439242%2fimproving-perfomance-of-lstm-on-text-classification-3-class-problem%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439242%2fimproving-perfomance-of-lstm-on-text-classification-3-class-problem%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less
– Biswadip Mandal
Nov 23 '18 at 7:38
The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.
– hiphopdance 444
Nov 23 '18 at 10:18
yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network
– Biswadip Mandal
Nov 23 '18 at 11:24
I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?
– hiphopdance 444
Nov 23 '18 at 12:36
1
Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp
– Biswadip Mandal
Nov 23 '18 at 12:54