Improving Perfomance of LSTM on text classification (3-class problem)

My problem is a 3 class sentiment analysis classification problem with 4000 reviews of average around 500 words length each. The dataset distribution of sentiments are 1800 negative 1700 neutral and 500 positive. I am trying the following LSTM but as i was searching on how to improve perfomance by changing the parameters i didnt find any specific rules on how to choose them, most of the answers i found was "it depends on the problem", but as i am a noobie on the subject of deep learning i dont really understand from where to start. My model achieves around 63% accuract, tested with k=5 cross val. Thank you in advance. This is the code i have so far:

data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')

X = data['text']

Y = data['polarity']



x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0)  #split train/test data



batch_size = 64

epochs=5

max_len = 500

max_words=5000



tokenizer = Tokenizer(max_words)

tokenizer.fit_on_texts(x_train)



x_train= tokenizer.texts_to_sequences(x_train)

x_test= tokenizer.texts_to_sequences(x_test)

x_train=np.array(x_train)

x_test=np.array(x_test)





x_train = sequence.pad_sequences(x_train, maxlen=max_len)

x_test = sequence.pad_sequences(x_test, maxlen=max_len)



temp=np.array(y_test)



# create the model

embedding_vecor_length = 64

model = Sequential()

model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))

model.add(LSTM(100))

model.add(Dense(3, activation='softmax'))



model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())



filepath="weights.best.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

callbacks_list = [checkpoint]



model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)



#load the saved model

print("Loading Best Model Overall")

model.load_weights("weights.best.hdf5")

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])



#Final evaluation of the model

scores = model.evaluate(x_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores[1]*100))

asked Nov 22 '18 at 23:59

hiphopdance 444

a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

– Biswadip Mandal
Nov 23 '18 at 7:38

The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

– hiphopdance 444
Nov 23 '18 at 10:18

yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

– Biswadip Mandal
Nov 23 '18 at 11:24

I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

– hiphopdance 444
Nov 23 '18 at 12:36

1

Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

– Biswadip Mandal
Nov 23 '18 at 12:54

|
show 1 more comment

data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')

X = data['text']

Y = data['polarity']



x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0)  #split train/test data



batch_size = 64

epochs=5

max_len = 500

max_words=5000



tokenizer = Tokenizer(max_words)

tokenizer.fit_on_texts(x_train)



x_train= tokenizer.texts_to_sequences(x_train)

x_test= tokenizer.texts_to_sequences(x_test)

x_train=np.array(x_train)

x_test=np.array(x_test)





x_train = sequence.pad_sequences(x_train, maxlen=max_len)

x_test = sequence.pad_sequences(x_test, maxlen=max_len)



temp=np.array(y_test)



# create the model

embedding_vecor_length = 64

model = Sequential()

model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))

model.add(LSTM(100))

model.add(Dense(3, activation='softmax'))



model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())



filepath="weights.best.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

callbacks_list = [checkpoint]



model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)



#load the saved model

print("Loading Best Model Overall")

model.load_weights("weights.best.hdf5")

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])



#Final evaluation of the model

scores = model.evaluate(x_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores[1]*100))

asked Nov 22 '18 at 23:59

hiphopdance 444

a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

– Biswadip Mandal
Nov 23 '18 at 7:38

The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

– hiphopdance 444
Nov 23 '18 at 10:18

yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

– Biswadip Mandal
Nov 23 '18 at 11:24

I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

– hiphopdance 444
Nov 23 '18 at 12:36

1

Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

– Biswadip Mandal
Nov 23 '18 at 12:54

|
show 1 more comment

data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')

X = data['text']

Y = data['polarity']



x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0)  #split train/test data



batch_size = 64

epochs=5

max_len = 500

max_words=5000



tokenizer = Tokenizer(max_words)

tokenizer.fit_on_texts(x_train)



x_train= tokenizer.texts_to_sequences(x_train)

x_test= tokenizer.texts_to_sequences(x_test)

x_train=np.array(x_train)

x_test=np.array(x_test)





x_train = sequence.pad_sequences(x_train, maxlen=max_len)

x_test = sequence.pad_sequences(x_test, maxlen=max_len)



temp=np.array(y_test)



# create the model

embedding_vecor_length = 64

model = Sequential()

model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))

model.add(LSTM(100))

model.add(Dense(3, activation='softmax'))



model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())



filepath="weights.best.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

callbacks_list = [checkpoint]



model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)



#load the saved model

print("Loading Best Model Overall")

model.load_weights("weights.best.hdf5")

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])



#Final evaluation of the model

scores = model.evaluate(x_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores[1]*100))

asked Nov 22 '18 at 23:59

hiphopdance 444

data = pd.read_csv("nopreall.csv",header=0,encoding = 'UTF-8')

X = data['text']

Y = data['polarity']



x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0)  #split train/test data



batch_size = 64

epochs=5

max_len = 500

max_words=5000



tokenizer = Tokenizer(max_words)

tokenizer.fit_on_texts(x_train)



x_train= tokenizer.texts_to_sequences(x_train)

x_test= tokenizer.texts_to_sequences(x_test)

x_train=np.array(x_train)

x_test=np.array(x_test)





x_train = sequence.pad_sequences(x_train, maxlen=max_len)

x_test = sequence.pad_sequences(x_test, maxlen=max_len)



temp=np.array(y_test)



# create the model

embedding_vecor_length = 64

model = Sequential()

model.add(Embedding(max_words, embedding_vecor_length,input_length=max_len))

model.add(LSTM(100))

model.add(Dense(3, activation='softmax'))



model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())



filepath="weights.best.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

callbacks_list = [checkpoint]



model.fit(x_train, y_train, validation_split=0.1, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)



#load the saved model

print("Loading Best Model Overall")

model.load_weights("weights.best.hdf5")

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])



#Final evaluation of the model

scores = model.evaluate(x_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores[1]*100))

machine-learning deep-learning lstm sentiment-analysis text-classification

asked Nov 22 '18 at 23:59

hiphopdance 444

asked Nov 22 '18 at 23:59

hiphopdance 444

asked Nov 22 '18 at 23:59

hiphopdance 444

asked Nov 22 '18 at 23:59

hiphopdance 444

asked Nov 22 '18 at 23:59

hiphopdance 444

a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

– Biswadip Mandal
Nov 23 '18 at 7:38

The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

– hiphopdance 444
Nov 23 '18 at 10:18

yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

– Biswadip Mandal
Nov 23 '18 at 11:24

I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

– hiphopdance 444
Nov 23 '18 at 12:36

1

Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

– Biswadip Mandal
Nov 23 '18 at 12:54

|
show 1 more comment

a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

– Biswadip Mandal
Nov 23 '18 at 7:38

The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

– hiphopdance 444
Nov 23 '18 at 10:18

yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

– Biswadip Mandal
Nov 23 '18 at 11:24

I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

– hiphopdance 444
Nov 23 '18 at 12:36

1

Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

– Biswadip Mandal
Nov 23 '18 at 12:54

a length of 500 is too big for a LSTM network. You have a large amount of weight and very less data to train them. I think you are not using any pre-trained word embedding which increases the number of variables to train. You should try to use some relevant pre-trained embedding. I would suggest you not to use LSTM as the sequence length is too big and data is quite less

– Biswadip Mandal
Nov 23 '18 at 7:38

The length of 500 for each review is the one i picked myself i could pick less length aswell, so should i reduce it? Also i have the question, are pre-trained word embeddings language dependent? Because the language of the reviews is not english.

– hiphopdance 444
Nov 23 '18 at 10:18

yes. pre-trained word embedding are language dependent. You can google and see if you can find any for your language. The maximum length should be choosen carefully. It should large enough that you don't loose any information and it should be small enough so that you have a reasonable LSTM network

– Biswadip Mandal
Nov 23 '18 at 11:24

I really appreciate your help, thank you. This is a distribution of the lengths of my articles. Specifically for 2605 articles i have mean=335length and st.dev=348. Most of my articles (around 2000) though are in the range of 0-500lengths. What length do you suggest should i start with?

– hiphopdance 444
Nov 23 '18 at 12:36

Even 300 is too big a length for LSTM. Though theoretically LSTMs are supposed handle very long sequences, it doesn't really happen in practical scenarios. I am very keen to look at your data. You have few options here. You can choose two different Networks. One for each sentence and then 2nd one to feeding the sentences sequentially. Go through this blog for more details: explosion.ai/blog/deep-learning-formula-nlp

– Biswadip Mandal
Nov 23 '18 at 12:54

|
show 1 more comment

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439242%2fimproving-perfomance-of-lstm-on-text-classification-3-class-problem%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr