Improving Perfomance of LSTM on text classification (3-class problem)












1















My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:



data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')
X = data['text']
Y = data['polarity']

x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) #split train/test data

batch_size = 64
epochs=5
max_len = 500
max_words=5000

tokenizer = Tokenizer(max_words)
tokenizer.fit_on_texts(x_train)

x_train= tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train=np.array(x_train)
x_test=np.array(x_test)


x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

temp=np.array(y_test)

# create the model
embedding_vecor_length = 64
model = Sequential()
model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)

#load the saved model
print("Loading Best Model Overall")
model.load_weights("weights.best.hdf5")
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

#Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))









share|improve this question























  • a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

    – Biswadip Mandal
    Nov 23 '18 at 7:38













  • The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

    – hiphopdance 444
    Nov 23 '18 at 10:18











  • yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

    – Biswadip Mandal
    Nov 23 '18 at 11:24











  • I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

    – hiphopdance 444
    Nov 23 '18 at 12:36








  • 1





    Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

    – Biswadip Mandal
    Nov 23 '18 at 12:54
















1















My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:



data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')
X = data['text']
Y = data['polarity']

x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) #split train/test data

batch_size = 64
epochs=5
max_len = 500
max_words=5000

tokenizer = Tokenizer(max_words)
tokenizer.fit_on_texts(x_train)

x_train= tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train=np.array(x_train)
x_test=np.array(x_test)


x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

temp=np.array(y_test)

# create the model
embedding_vecor_length = 64
model = Sequential()
model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)

#load the saved model
print("Loading Best Model Overall")
model.load_weights("weights.best.hdf5")
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

#Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))









share|improve this question























  • a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

    – Biswadip Mandal
    Nov 23 '18 at 7:38













  • The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

    – hiphopdance 444
    Nov 23 '18 at 10:18











  • yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

    – Biswadip Mandal
    Nov 23 '18 at 11:24











  • I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

    – hiphopdance 444
    Nov 23 '18 at 12:36








  • 1





    Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

    – Biswadip Mandal
    Nov 23 '18 at 12:54














1












1








1








My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:



data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')
X = data['text']
Y = data['polarity']

x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) #split train/test data

batch_size = 64
epochs=5
max_len = 500
max_words=5000

tokenizer = Tokenizer(max_words)
tokenizer.fit_on_texts(x_train)

x_train= tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train=np.array(x_train)
x_test=np.array(x_test)


x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

temp=np.array(y_test)

# create the model
embedding_vecor_length = 64
model = Sequential()
model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)

#load the saved model
print("Loading Best Model Overall")
model.load_weights("weights.best.hdf5")
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

#Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))









share|improve this question














My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:



data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')
X = data['text']
Y = data['polarity']

x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0) #split train/test data

batch_size = 64
epochs=5
max_len = 500
max_words=5000

tokenizer = Tokenizer(max_words)
tokenizer.fit_on_texts(x_train)

x_train= tokenizer.texts_to_sequences(x_train)
x_test= tokenizer.texts_to_sequences(x_test)
x_train=np.array(x_train)
x_test=np.array(x_test)


x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

temp=np.array(y_test)

# create the model
embedding_vecor_length = 64
model = Sequential()
model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)

#load the saved model
print("Loading Best Model Overall")
model.load_weights("weights.best.hdf5")
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

#Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))






machine-learning deep-learning lstm sentiment-analysis text-classification






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 22 '18 at 23:59









hiphopdance 444hiphopdance 444

71




71













  • a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

    – Biswadip Mandal
    Nov 23 '18 at 7:38













  • The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

    – hiphopdance 444
    Nov 23 '18 at 10:18











  • yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

    – Biswadip Mandal
    Nov 23 '18 at 11:24











  • I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

    – hiphopdance 444
    Nov 23 '18 at 12:36








  • 1





    Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

    – Biswadip Mandal
    Nov 23 '18 at 12:54



















  • a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

    – Biswadip Mandal
    Nov 23 '18 at 7:38













  • The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

    – hiphopdance 444
    Nov 23 '18 at 10:18











  • yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

    – Biswadip Mandal
    Nov 23 '18 at 11:24











  • I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

    – hiphopdance 444
    Nov 23 '18 at 12:36








  • 1





    Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

    – Biswadip Mandal
    Nov 23 '18 at 12:54

















a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

– Biswadip Mandal
Nov 23 '18 at 7:38







a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

– Biswadip Mandal
Nov 23 '18 at 7:38















The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

– hiphopdance 444
Nov 23 '18 at 10:18





The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

– hiphopdance 444
Nov 23 '18 at 10:18













yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

– Biswadip Mandal
Nov 23 '18 at 11:24





yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

– Biswadip Mandal
Nov 23 '18 at 11:24













I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

– hiphopdance 444
Nov 23 '18 at 12:36







I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

– hiphopdance 444
Nov 23 '18 at 12:36






1




1





Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

– Biswadip Mandal
Nov 23 '18 at 12:54





Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

– Biswadip Mandal
Nov 23 '18 at 12:54












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439242%2fimproving-perfomance-of-lstm-on-text-classification-3-class-problem%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439242%2fimproving-perfomance-of-lstm-on-text-classification-3-class-problem%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

"Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

Alcedinidae

Origin of the phrase “under your belt”?