How to avoid loading a large file into a python script repeatedly?












3















I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.



My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:



def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list =
train_val_list =
for line in train_f:
list_line = line.strip().split("t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)

return(train_id_list,train_val_array)


This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.



I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).










share|improve this question


















  • 1





    I believe that if you run in the python console, you could load the file once and then load other files / call functions separately, without having to reload the file

    – user985366
    Jun 23 '15 at 22:47











  • You must have a look at the pandas library for data handling. Manipulating data using it is a child's play. You will be able to grasp it fairly quickly if you have used R before. Specifically, you should have a look at the read_xxx functions in the documentation which allow you to load different file formats into a dataframe.

    – Lakshay Garg
    Jun 23 '15 at 22:53


















3















I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.



My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:



def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list =
train_val_list =
for line in train_f:
list_line = line.strip().split("t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)

return(train_id_list,train_val_array)


This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.



I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).










share|improve this question


















  • 1





    I believe that if you run in the python console, you could load the file once and then load other files / call functions separately, without having to reload the file

    – user985366
    Jun 23 '15 at 22:47











  • You must have a look at the pandas library for data handling. Manipulating data using it is a child's play. You will be able to grasp it fairly quickly if you have used R before. Specifically, you should have a look at the read_xxx functions in the documentation which allow you to load different file formats into a dataframe.

    – Lakshay Garg
    Jun 23 '15 at 22:53
















3












3








3


1






I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.



My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:



def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list =
train_val_list =
for line in train_f:
list_line = line.strip().split("t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)

return(train_id_list,train_val_array)


This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.



I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).










share|improve this question














I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.



My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:



def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list =
train_val_list =
for line in train_f:
list_line = line.strip().split("t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)

return(train_id_list,train_val_array)


This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.



I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).







python object large-file-upload






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jun 23 '15 at 22:42









Brandon KieftBrandon Kieft

366




366








  • 1





    I believe that if you run in the python console, you could load the file once and then load other files / call functions separately, without having to reload the file

    – user985366
    Jun 23 '15 at 22:47











  • You must have a look at the pandas library for data handling. Manipulating data using it is a child's play. You will be able to grasp it fairly quickly if you have used R before. Specifically, you should have a look at the read_xxx functions in the documentation which allow you to load different file formats into a dataframe.

    – Lakshay Garg
    Jun 23 '15 at 22:53
















  • 1





    I believe that if you run in the python console, you could load the file once and then load other files / call functions separately, without having to reload the file

    – user985366
    Jun 23 '15 at 22:47











  • You must have a look at the pandas library for data handling. Manipulating data using it is a child's play. You will be able to grasp it fairly quickly if you have used R before. Specifically, you should have a look at the read_xxx functions in the documentation which allow you to load different file formats into a dataframe.

    – Lakshay Garg
    Jun 23 '15 at 22:53










1




1





I believe that if you run in the python console, you could load the file once and then load other files / call functions separately, without having to reload the file

– user985366
Jun 23 '15 at 22:47





I believe that if you run in the python console, you could load the file once and then load other files / call functions separately, without having to reload the file

– user985366
Jun 23 '15 at 22:47













You must have a look at the pandas library for data handling. Manipulating data using it is a child's play. You will be able to grasp it fairly quickly if you have used R before. Specifically, you should have a look at the read_xxx functions in the documentation which allow you to load different file formats into a dataframe.

– Lakshay Garg
Jun 23 '15 at 22:53







You must have a look at the pandas library for data handling. Manipulating data using it is a child's play. You will be able to grasp it fairly quickly if you have used R before. Specifically, you should have a look at the read_xxx functions in the documentation which allow you to load different file formats into a dataframe.

– Lakshay Garg
Jun 23 '15 at 22:53














4 Answers
4






active

oldest

votes


















1














If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.



I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.



Then you can import another file with your model code, and run that with the training data as argument.



If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.



If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.



If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)






share|improve this answer

































    0














    Simplest way would be to cache the results, like so:



    _train_data_cache = {}
    def load_cached_train_data(train_file):
    if train_file not in _train_data_cache:
    _train_data_cache[train_file] = load_train_data(train_file)
    return _train_data_cache[train_file]





    share|improve this answer































      0














      Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.






      share|improve this answer































        0














        Load your data in ipython.



        my_data = open("data.txt")


        Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:



        import sys

        args = sys.argv

        data = args[1]
        ...


        Now run the python script in ipython:



        %run example.py $mydata


        Now, when running your python script, you don't need to load data multiple times.






        share|improve this answer

























          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f31014729%2fhow-to-avoid-loading-a-large-file-into-a-python-script-repeatedly%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          4 Answers
          4






          active

          oldest

          votes








          4 Answers
          4






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.



          I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.



          Then you can import another file with your model code, and run that with the training data as argument.



          If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.



          If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.



          If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)






          share|improve this answer






























            1














            If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.



            I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.



            Then you can import another file with your model code, and run that with the training data as argument.



            If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.



            If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.



            If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)






            share|improve this answer




























              1












              1








              1







              If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.



              I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.



              Then you can import another file with your model code, and run that with the training data as argument.



              If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.



              If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.



              If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)






              share|improve this answer















              If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.



              I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.



              Then you can import another file with your model code, and run that with the training data as argument.



              If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.



              If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.



              If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited May 23 '17 at 12:14









              Community

              11




              11










              answered Jun 23 '15 at 22:55









              user985366user985366

              7692928




              7692928

























                  0














                  Simplest way would be to cache the results, like so:



                  _train_data_cache = {}
                  def load_cached_train_data(train_file):
                  if train_file not in _train_data_cache:
                  _train_data_cache[train_file] = load_train_data(train_file)
                  return _train_data_cache[train_file]





                  share|improve this answer




























                    0














                    Simplest way would be to cache the results, like so:



                    _train_data_cache = {}
                    def load_cached_train_data(train_file):
                    if train_file not in _train_data_cache:
                    _train_data_cache[train_file] = load_train_data(train_file)
                    return _train_data_cache[train_file]





                    share|improve this answer


























                      0












                      0








                      0







                      Simplest way would be to cache the results, like so:



                      _train_data_cache = {}
                      def load_cached_train_data(train_file):
                      if train_file not in _train_data_cache:
                      _train_data_cache[train_file] = load_train_data(train_file)
                      return _train_data_cache[train_file]





                      share|improve this answer













                      Simplest way would be to cache the results, like so:



                      _train_data_cache = {}
                      def load_cached_train_data(train_file):
                      if train_file not in _train_data_cache:
                      _train_data_cache[train_file] = load_train_data(train_file)
                      return _train_data_cache[train_file]






                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Jun 23 '15 at 22:47









                      MalvolioMalvolio

                      24.9k2180112




                      24.9k2180112























                          0














                          Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.






                          share|improve this answer




























                            0














                            Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.






                            share|improve this answer


























                              0












                              0








                              0







                              Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.






                              share|improve this answer













                              Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.







                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Jun 23 '15 at 22:51









                              umbreon222umbreon222

                              15917




                              15917























                                  0














                                  Load your data in ipython.



                                  my_data = open("data.txt")


                                  Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:



                                  import sys

                                  args = sys.argv

                                  data = args[1]
                                  ...


                                  Now run the python script in ipython:



                                  %run example.py $mydata


                                  Now, when running your python script, you don't need to load data multiple times.






                                  share|improve this answer






























                                    0














                                    Load your data in ipython.



                                    my_data = open("data.txt")


                                    Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:



                                    import sys

                                    args = sys.argv

                                    data = args[1]
                                    ...


                                    Now run the python script in ipython:



                                    %run example.py $mydata


                                    Now, when running your python script, you don't need to load data multiple times.






                                    share|improve this answer




























                                      0












                                      0








                                      0







                                      Load your data in ipython.



                                      my_data = open("data.txt")


                                      Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:



                                      import sys

                                      args = sys.argv

                                      data = args[1]
                                      ...


                                      Now run the python script in ipython:



                                      %run example.py $mydata


                                      Now, when running your python script, you don't need to load data multiple times.






                                      share|improve this answer















                                      Load your data in ipython.



                                      my_data = open("data.txt")


                                      Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:



                                      import sys

                                      args = sys.argv

                                      data = args[1]
                                      ...


                                      Now run the python script in ipython:



                                      %run example.py $mydata


                                      Now, when running your python script, you don't need to load data multiple times.







                                      share|improve this answer














                                      share|improve this answer



                                      share|improve this answer








                                      edited Nov 23 '18 at 3:08









                                      AS Mackay

                                      2,01351121




                                      2,01351121










                                      answered Nov 22 '18 at 23:37









                                      Mehdi RostamiMehdi Rostami

                                      1113




                                      1113






























                                          draft saved

                                          draft discarded




















































                                          Thanks for contributing an answer to Stack Overflow!


                                          • Please be sure to answer the question. Provide details and share your research!

                                          But avoid



                                          • Asking for help, clarification, or responding to other answers.

                                          • Making statements based on opinion; back them up with references or personal experience.


                                          To learn more, see our tips on writing great answers.




                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function () {
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f31014729%2fhow-to-avoid-loading-a-large-file-into-a-python-script-repeatedly%23new-answer', 'question_page');
                                          }
                                          );

                                          Post as a guest















                                          Required, but never shown





















































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown

































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown







                                          Popular posts from this blog

                                          "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

                                          Alcedinidae

                                          RAC Tourist Trophy