Efficiently find an integer not in a set of size 40, 400, or 4000












22














Related to the classic problem find an integer not among four billion given ones but not exactly the same.



To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int in the range of [INT_MIN, INT_MAX].



Now given a std::vector<int> (no duplicates) or std::unordered_set<int>, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?



If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX.



I am more in favor of simple, non-random approaches. Is there any?



Thank you!



Update: to clear up ambiguity, let's say an unsorted std::vector<int> which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN and INT_MAX.










share|improve this question




















  • 2




    sorting the vector can be done in O(n log(n)), not sure if you can get anything more efficient
    – user463035818
    Dec 18 at 14:41








  • 2




    Is the vector sorted? For the set it is trivial since it is sorted.
    – NathanOliver
    Dec 18 at 14:41






  • 1




    @user463035818 if you can sort a vector in O(log(n)), you should become very rich indeed.
    – Walter
    Dec 18 at 14:43






  • 1




    @Walter meh it was a typo, no millions for me unfortunately
    – user463035818
    Dec 18 at 14:44






  • 1




    Once sorted, you do: Last + 1?
    – JVApen
    Dec 18 at 14:47
















22














Related to the classic problem find an integer not among four billion given ones but not exactly the same.



To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int in the range of [INT_MIN, INT_MAX].



Now given a std::vector<int> (no duplicates) or std::unordered_set<int>, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?



If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX.



I am more in favor of simple, non-random approaches. Is there any?



Thank you!



Update: to clear up ambiguity, let's say an unsorted std::vector<int> which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN and INT_MAX.










share|improve this question




















  • 2




    sorting the vector can be done in O(n log(n)), not sure if you can get anything more efficient
    – user463035818
    Dec 18 at 14:41








  • 2




    Is the vector sorted? For the set it is trivial since it is sorted.
    – NathanOliver
    Dec 18 at 14:41






  • 1




    @user463035818 if you can sort a vector in O(log(n)), you should become very rich indeed.
    – Walter
    Dec 18 at 14:43






  • 1




    @Walter meh it was a typo, no millions for me unfortunately
    – user463035818
    Dec 18 at 14:44






  • 1




    Once sorted, you do: Last + 1?
    – JVApen
    Dec 18 at 14:47














22












22








22


7





Related to the classic problem find an integer not among four billion given ones but not exactly the same.



To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int in the range of [INT_MIN, INT_MAX].



Now given a std::vector<int> (no duplicates) or std::unordered_set<int>, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?



If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX.



I am more in favor of simple, non-random approaches. Is there any?



Thank you!



Update: to clear up ambiguity, let's say an unsorted std::vector<int> which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN and INT_MAX.










share|improve this question















Related to the classic problem find an integer not among four billion given ones but not exactly the same.



To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int in the range of [INT_MIN, INT_MAX].



Now given a std::vector<int> (no duplicates) or std::unordered_set<int>, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?



If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX.



I am more in favor of simple, non-random approaches. Is there any?



Thank you!



Update: to clear up ambiguity, let's say an unsorted std::vector<int> which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN and INT_MAX.







c++ algorithm






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 18 at 15:17

























asked Dec 18 at 14:37









fleix

40610




40610








  • 2




    sorting the vector can be done in O(n log(n)), not sure if you can get anything more efficient
    – user463035818
    Dec 18 at 14:41








  • 2




    Is the vector sorted? For the set it is trivial since it is sorted.
    – NathanOliver
    Dec 18 at 14:41






  • 1




    @user463035818 if you can sort a vector in O(log(n)), you should become very rich indeed.
    – Walter
    Dec 18 at 14:43






  • 1




    @Walter meh it was a typo, no millions for me unfortunately
    – user463035818
    Dec 18 at 14:44






  • 1




    Once sorted, you do: Last + 1?
    – JVApen
    Dec 18 at 14:47














  • 2




    sorting the vector can be done in O(n log(n)), not sure if you can get anything more efficient
    – user463035818
    Dec 18 at 14:41








  • 2




    Is the vector sorted? For the set it is trivial since it is sorted.
    – NathanOliver
    Dec 18 at 14:41






  • 1




    @user463035818 if you can sort a vector in O(log(n)), you should become very rich indeed.
    – Walter
    Dec 18 at 14:43






  • 1




    @Walter meh it was a typo, no millions for me unfortunately
    – user463035818
    Dec 18 at 14:44






  • 1




    Once sorted, you do: Last + 1?
    – JVApen
    Dec 18 at 14:47








2




2




sorting the vector can be done in O(n log(n)), not sure if you can get anything more efficient
– user463035818
Dec 18 at 14:41






sorting the vector can be done in O(n log(n)), not sure if you can get anything more efficient
– user463035818
Dec 18 at 14:41






2




2




Is the vector sorted? For the set it is trivial since it is sorted.
– NathanOliver
Dec 18 at 14:41




Is the vector sorted? For the set it is trivial since it is sorted.
– NathanOliver
Dec 18 at 14:41




1




1




@user463035818 if you can sort a vector in O(log(n)), you should become very rich indeed.
– Walter
Dec 18 at 14:43




@user463035818 if you can sort a vector in O(log(n)), you should become very rich indeed.
– Walter
Dec 18 at 14:43




1




1




@Walter meh it was a typo, no millions for me unfortunately
– user463035818
Dec 18 at 14:44




@Walter meh it was a typo, no millions for me unfortunately
– user463035818
Dec 18 at 14:44




1




1




Once sorted, you do: Last + 1?
– JVApen
Dec 18 at 14:47




Once sorted, you do: Last + 1?
– JVApen
Dec 18 at 14:47












6 Answers
6






active

oldest

votes


















27














You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.



 int find_not_contained(container<int> const&data)
{
const int N=data.size();
std::vector<char> known(N+1, 0); // one more candidates than data
for(int i=0; i< N; ++i)
if(data[i]>=0 && data[i]<=N)
known[data[i]]=1;
for(int i=0; i<=N; ++i)
if(!known[i])
return i;
assert(false); // should never be reached.
}


Random methods can be more space efficient, but may require more passes over the data in the worst case.






share|improve this answer



















  • 2




    I like this. It avoids sorting and searching.
    – fleix
    Dec 18 at 15:50






  • 3




    This is the standard algorithm for finding the smallest integer not in a given set
    – DreamConspiracy
    Dec 18 at 22:20










  • @DreamConspiracy That makes sense. I didn't know that, though.
    – Walter
    Dec 19 at 0:39






  • 1




    Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a 1 so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known to int, which may not be worth it for large N. With large enough N, you'd want to use a bitmap instead of a char array.)
    – Peter Cordes
    Dec 19 at 3:59












  • The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized strlen. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
    – Peter Cordes
    Dec 19 at 4:03



















7














Random methods are indeed very efficient here.



If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.



For each i in the range, we just set x[array[i] modulo m] = 1.



Then a simple O(n) search in x will provide a value which is not in array



Note: the modulo operation is not exactly the "%" operation



Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation






share|improve this answer































    5














    You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)



    size_t smallest_unused(std::vector<unsigned>& data) {
    size_t N = data.size(), scan = 0;
    while (scan < N) {
    auto other = data[scan];
    if (other < scan && data[other] != other) {
    data[scan] = data[other];
    data[other] = other;
    }
    else
    ++scan;
    }
    for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
    return scan;
    }


    The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.



    That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.



    To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).





    Notes:




    1. I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.






    share|improve this answer























    • To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
      – Taemyr
      Dec 19 at 14:13










    • @taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
      – rici
      Dec 19 at 14:18





















    3














    Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).






    share|improve this answer

















    • 2




      He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
      – dquijada
      Dec 18 at 14:54






    • 1




      @dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
      – Acorn
      Dec 18 at 15:15








    • 1




      @Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
      – Bakuriu
      Dec 18 at 23:31






    • 1




      @Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
      – Acorn
      Dec 18 at 23:35








    • 3




      Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
      – Acorn
      Dec 18 at 23:42





















    1














    Step 1: Sort the vector.



    That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.



    Step 2: Find the first int not in the vector.



    Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:



    Pseudocode:



    SIZE = 40|400|4000 // The one you are using
    for (int i = 0; i < SIZE; i++) {
    if (array[i] != INT_MIN + i)
    return INT_MIN + i;


    The solution would be O(n log(n) + n) meaning: O(n log(n))





    Edit: just read your edit asking for something better than O(n log(n)), sorry.






    share|improve this answer























    • Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem array[i] would make any sense when i starts from INT_MIN.
      – fleix
      Dec 18 at 15:10












    • You are right, @fleix, it's fixed now
      – dquijada
      Dec 18 at 15:14








    • 1




      You can do the search using binary search in O(log N).
      – rici
      Dec 18 at 20:25






    • 1




      @rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in O(n / max_n) time on average, but worst case n (no gaps until the end.)
      – Peter Cordes
      Dec 18 at 23:50






    • 2




      @peter: the same way you do a linear search, compare i + min with vec[i]. If they are unequal, there is a missing value before i, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
      – rici
      Dec 19 at 2:05





















    0














    For the case in which the integers are provided in an std::unordered_set<int> (as opposed to a std::vector<int>), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>. Searching for the presence of an integer in an std::unordered_set<int> is quite straightforward, since std::unodered_set does provide searching through its find() member function.



    The space complexity of this approach would be O(1).





    If you start traversing at the lowest possible value for an int (i.e., std::numeric_limits<int>::min()), you will obtain the lowest int not contained in the std::unordered_set<int>:



    int find_lowest_not_contained(const std::unordered_set<int>& set) {
    for (auto i = std::numeric_limits<int>::min(); ; ++i) {
    auto it = set.find(i); // search in set
    if (it == set.end()) // integer not in set?
    return *it;
    }
    }


    Analogously, if you start traversing at the greatest possible value for an int (i.e., std::numeric_limits<int>::max()), you will obtain the lowest int not contained in the std::unordered_set<int>:



    int find_greatest_not_contained(const std::unordered_set<int>& set) {
    for (auto i = std::numeric_limits<int>::max(); ; --i) {
    auto it = set.find(i); // search in set
    if (it == set.end()) // integer not in set?
    return *it;
    }
    }




    Assuming that the ints are uniformly mapped by the hash function into the unordered_set<int>'s buckets, a search operation on the unordered_set<int> can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int> (i.e., in your case M <= 4000).



    Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set, is guaranteed to come up against an integer value which is not present in the unordered_set<int>.






    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53835354%2fefficiently-find-an-integer-not-in-a-set-of-size-40-400-or-4000%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      6 Answers
      6






      active

      oldest

      votes








      6 Answers
      6






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      27














      You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.



       int find_not_contained(container<int> const&data)
      {
      const int N=data.size();
      std::vector<char> known(N+1, 0); // one more candidates than data
      for(int i=0; i< N; ++i)
      if(data[i]>=0 && data[i]<=N)
      known[data[i]]=1;
      for(int i=0; i<=N; ++i)
      if(!known[i])
      return i;
      assert(false); // should never be reached.
      }


      Random methods can be more space efficient, but may require more passes over the data in the worst case.






      share|improve this answer



















      • 2




        I like this. It avoids sorting and searching.
        – fleix
        Dec 18 at 15:50






      • 3




        This is the standard algorithm for finding the smallest integer not in a given set
        – DreamConspiracy
        Dec 18 at 22:20










      • @DreamConspiracy That makes sense. I didn't know that, though.
        – Walter
        Dec 19 at 0:39






      • 1




        Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a 1 so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known to int, which may not be worth it for large N. With large enough N, you'd want to use a bitmap instead of a char array.)
        – Peter Cordes
        Dec 19 at 3:59












      • The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized strlen. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
        – Peter Cordes
        Dec 19 at 4:03
















      27














      You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.



       int find_not_contained(container<int> const&data)
      {
      const int N=data.size();
      std::vector<char> known(N+1, 0); // one more candidates than data
      for(int i=0; i< N; ++i)
      if(data[i]>=0 && data[i]<=N)
      known[data[i]]=1;
      for(int i=0; i<=N; ++i)
      if(!known[i])
      return i;
      assert(false); // should never be reached.
      }


      Random methods can be more space efficient, but may require more passes over the data in the worst case.






      share|improve this answer



















      • 2




        I like this. It avoids sorting and searching.
        – fleix
        Dec 18 at 15:50






      • 3




        This is the standard algorithm for finding the smallest integer not in a given set
        – DreamConspiracy
        Dec 18 at 22:20










      • @DreamConspiracy That makes sense. I didn't know that, though.
        – Walter
        Dec 19 at 0:39






      • 1




        Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a 1 so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known to int, which may not be worth it for large N. With large enough N, you'd want to use a bitmap instead of a char array.)
        – Peter Cordes
        Dec 19 at 3:59












      • The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized strlen. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
        – Peter Cordes
        Dec 19 at 4:03














      27












      27








      27






      You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.



       int find_not_contained(container<int> const&data)
      {
      const int N=data.size();
      std::vector<char> known(N+1, 0); // one more candidates than data
      for(int i=0; i< N; ++i)
      if(data[i]>=0 && data[i]<=N)
      known[data[i]]=1;
      for(int i=0; i<=N; ++i)
      if(!known[i])
      return i;
      assert(false); // should never be reached.
      }


      Random methods can be more space efficient, but may require more passes over the data in the worst case.






      share|improve this answer














      You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.



       int find_not_contained(container<int> const&data)
      {
      const int N=data.size();
      std::vector<char> known(N+1, 0); // one more candidates than data
      for(int i=0; i< N; ++i)
      if(data[i]>=0 && data[i]<=N)
      known[data[i]]=1;
      for(int i=0; i<=N; ++i)
      if(!known[i])
      return i;
      assert(false); // should never be reached.
      }


      Random methods can be more space efficient, but may require more passes over the data in the worst case.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Dec 18 at 15:34

























      answered Dec 18 at 15:15









      Walter

      27k1375151




      27k1375151








      • 2




        I like this. It avoids sorting and searching.
        – fleix
        Dec 18 at 15:50






      • 3




        This is the standard algorithm for finding the smallest integer not in a given set
        – DreamConspiracy
        Dec 18 at 22:20










      • @DreamConspiracy That makes sense. I didn't know that, though.
        – Walter
        Dec 19 at 0:39






      • 1




        Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a 1 so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known to int, which may not be worth it for large N. With large enough N, you'd want to use a bitmap instead of a char array.)
        – Peter Cordes
        Dec 19 at 3:59












      • The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized strlen. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
        – Peter Cordes
        Dec 19 at 4:03














      • 2




        I like this. It avoids sorting and searching.
        – fleix
        Dec 18 at 15:50






      • 3




        This is the standard algorithm for finding the smallest integer not in a given set
        – DreamConspiracy
        Dec 18 at 22:20










      • @DreamConspiracy That makes sense. I didn't know that, though.
        – Walter
        Dec 19 at 0:39






      • 1




        Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a 1 so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known to int, which may not be worth it for large N. With large enough N, you'd want to use a bitmap instead of a char array.)
        – Peter Cordes
        Dec 19 at 3:59












      • The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized strlen. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
        – Peter Cordes
        Dec 19 at 4:03








      2




      2




      I like this. It avoids sorting and searching.
      – fleix
      Dec 18 at 15:50




      I like this. It avoids sorting and searching.
      – fleix
      Dec 18 at 15:50




      3




      3




      This is the standard algorithm for finding the smallest integer not in a given set
      – DreamConspiracy
      Dec 18 at 22:20




      This is the standard algorithm for finding the smallest integer not in a given set
      – DreamConspiracy
      Dec 18 at 22:20












      @DreamConspiracy That makes sense. I didn't know that, though.
      – Walter
      Dec 19 at 0:39




      @DreamConspiracy That makes sense. I didn't know that, though.
      – Walter
      Dec 19 at 0:39




      1




      1




      Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a 1 so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known to int, which may not be worth it for large N. With large enough N, you'd want to use a bitmap instead of a char array.)
      – Peter Cordes
      Dec 19 at 3:59






      Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a 1 so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known to int, which may not be worth it for large N. With large enough N, you'd want to use a bitmap instead of a char array.)
      – Peter Cordes
      Dec 19 at 3:59














      The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized strlen. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
      – Peter Cordes
      Dec 19 at 4:03




      The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized strlen. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
      – Peter Cordes
      Dec 19 at 4:03













      7














      Random methods are indeed very efficient here.



      If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.



      For each i in the range, we just set x[array[i] modulo m] = 1.



      Then a simple O(n) search in x will provide a value which is not in array



      Note: the modulo operation is not exactly the "%" operation



      Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation






      share|improve this answer




























        7














        Random methods are indeed very efficient here.



        If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.



        For each i in the range, we just set x[array[i] modulo m] = 1.



        Then a simple O(n) search in x will provide a value which is not in array



        Note: the modulo operation is not exactly the "%" operation



        Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation






        share|improve this answer


























          7












          7








          7






          Random methods are indeed very efficient here.



          If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.



          For each i in the range, we just set x[array[i] modulo m] = 1.



          Then a simple O(n) search in x will provide a value which is not in array



          Note: the modulo operation is not exactly the "%" operation



          Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation






          share|improve this answer














          Random methods are indeed very efficient here.



          If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.



          For each i in the range, we just set x[array[i] modulo m] = 1.



          Then a simple O(n) search in x will provide a value which is not in array



          Note: the modulo operation is not exactly the "%" operation



          Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 18 at 15:30

























          answered Dec 18 at 15:12









          Damien

          59415




          59415























              5














              You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)



              size_t smallest_unused(std::vector<unsigned>& data) {
              size_t N = data.size(), scan = 0;
              while (scan < N) {
              auto other = data[scan];
              if (other < scan && data[other] != other) {
              data[scan] = data[other];
              data[other] = other;
              }
              else
              ++scan;
              }
              for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
              return scan;
              }


              The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.



              That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.



              To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).





              Notes:




              1. I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.






              share|improve this answer























              • To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
                – Taemyr
                Dec 19 at 14:13










              • @taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
                – rici
                Dec 19 at 14:18


















              5














              You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)



              size_t smallest_unused(std::vector<unsigned>& data) {
              size_t N = data.size(), scan = 0;
              while (scan < N) {
              auto other = data[scan];
              if (other < scan && data[other] != other) {
              data[scan] = data[other];
              data[other] = other;
              }
              else
              ++scan;
              }
              for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
              return scan;
              }


              The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.



              That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.



              To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).





              Notes:




              1. I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.






              share|improve this answer























              • To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
                – Taemyr
                Dec 19 at 14:13










              • @taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
                – rici
                Dec 19 at 14:18
















              5












              5








              5






              You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)



              size_t smallest_unused(std::vector<unsigned>& data) {
              size_t N = data.size(), scan = 0;
              while (scan < N) {
              auto other = data[scan];
              if (other < scan && data[other] != other) {
              data[scan] = data[other];
              data[other] = other;
              }
              else
              ++scan;
              }
              for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
              return scan;
              }


              The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.



              That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.



              To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).





              Notes:




              1. I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.






              share|improve this answer














              You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)



              size_t smallest_unused(std::vector<unsigned>& data) {
              size_t N = data.size(), scan = 0;
              while (scan < N) {
              auto other = data[scan];
              if (other < scan && data[other] != other) {
              data[scan] = data[other];
              data[other] = other;
              }
              else
              ++scan;
              }
              for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
              return scan;
              }


              The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.



              That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.



              To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).





              Notes:




              1. I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Dec 19 at 2:18

























              answered Dec 18 at 20:13









              rici

              151k19131196




              151k19131196












              • To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
                – Taemyr
                Dec 19 at 14:13










              • @taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
                – rici
                Dec 19 at 14:18




















              • To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
                – Taemyr
                Dec 19 at 14:13










              • @taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
                – rici
                Dec 19 at 14:18


















              To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
              – Taemyr
              Dec 19 at 14:13




              To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
              – Taemyr
              Dec 19 at 14:13












              @taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
              – rici
              Dec 19 at 14:18






              @taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
              – rici
              Dec 19 at 14:18













              3














              Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).






              share|improve this answer

















              • 2




                He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
                – dquijada
                Dec 18 at 14:54






              • 1




                @dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
                – Acorn
                Dec 18 at 15:15








              • 1




                @Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
                – Bakuriu
                Dec 18 at 23:31






              • 1




                @Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
                – Acorn
                Dec 18 at 23:35








              • 3




                Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
                – Acorn
                Dec 18 at 23:42


















              3














              Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).






              share|improve this answer

















              • 2




                He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
                – dquijada
                Dec 18 at 14:54






              • 1




                @dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
                – Acorn
                Dec 18 at 15:15








              • 1




                @Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
                – Bakuriu
                Dec 18 at 23:31






              • 1




                @Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
                – Acorn
                Dec 18 at 23:35








              • 3




                Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
                – Acorn
                Dec 18 at 23:42
















              3












              3








              3






              Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).






              share|improve this answer












              Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Dec 18 at 14:52









              Alexey Birukov

              465213




              465213








              • 2




                He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
                – dquijada
                Dec 18 at 14:54






              • 1




                @dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
                – Acorn
                Dec 18 at 15:15








              • 1




                @Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
                – Bakuriu
                Dec 18 at 23:31






              • 1




                @Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
                – Acorn
                Dec 18 at 23:35








              • 3




                Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
                – Acorn
                Dec 18 at 23:42
















              • 2




                He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
                – dquijada
                Dec 18 at 14:54






              • 1




                @dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
                – Acorn
                Dec 18 at 15:15








              • 1




                @Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
                – Bakuriu
                Dec 18 at 23:31






              • 1




                @Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
                – Acorn
                Dec 18 at 23:35








              • 3




                Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
                – Acorn
                Dec 18 at 23:42










              2




              2




              He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
              – dquijada
              Dec 18 at 14:54




              He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
              – dquijada
              Dec 18 at 14:54




              1




              1




              @dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
              – Acorn
              Dec 18 at 15:15






              @dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
              – Acorn
              Dec 18 at 15:15






              1




              1




              @Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
              – Bakuriu
              Dec 18 at 23:31




              @Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
              – Bakuriu
              Dec 18 at 23:31




              1




              1




              @Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
              – Acorn
              Dec 18 at 23:35






              @Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
              – Acorn
              Dec 18 at 23:35






              3




              3




              Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
              – Acorn
              Dec 18 at 23:42






              Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
              – Acorn
              Dec 18 at 23:42













              1














              Step 1: Sort the vector.



              That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.



              Step 2: Find the first int not in the vector.



              Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:



              Pseudocode:



              SIZE = 40|400|4000 // The one you are using
              for (int i = 0; i < SIZE; i++) {
              if (array[i] != INT_MIN + i)
              return INT_MIN + i;


              The solution would be O(n log(n) + n) meaning: O(n log(n))





              Edit: just read your edit asking for something better than O(n log(n)), sorry.






              share|improve this answer























              • Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem array[i] would make any sense when i starts from INT_MIN.
                – fleix
                Dec 18 at 15:10












              • You are right, @fleix, it's fixed now
                – dquijada
                Dec 18 at 15:14








              • 1




                You can do the search using binary search in O(log N).
                – rici
                Dec 18 at 20:25






              • 1




                @rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in O(n / max_n) time on average, but worst case n (no gaps until the end.)
                – Peter Cordes
                Dec 18 at 23:50






              • 2




                @peter: the same way you do a linear search, compare i + min with vec[i]. If they are unequal, there is a missing value before i, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
                – rici
                Dec 19 at 2:05


















              1














              Step 1: Sort the vector.



              That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.



              Step 2: Find the first int not in the vector.



              Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:



              Pseudocode:



              SIZE = 40|400|4000 // The one you are using
              for (int i = 0; i < SIZE; i++) {
              if (array[i] != INT_MIN + i)
              return INT_MIN + i;


              The solution would be O(n log(n) + n) meaning: O(n log(n))





              Edit: just read your edit asking for something better than O(n log(n)), sorry.






              share|improve this answer























              • Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem array[i] would make any sense when i starts from INT_MIN.
                – fleix
                Dec 18 at 15:10












              • You are right, @fleix, it's fixed now
                – dquijada
                Dec 18 at 15:14








              • 1




                You can do the search using binary search in O(log N).
                – rici
                Dec 18 at 20:25






              • 1




                @rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in O(n / max_n) time on average, but worst case n (no gaps until the end.)
                – Peter Cordes
                Dec 18 at 23:50






              • 2




                @peter: the same way you do a linear search, compare i + min with vec[i]. If they are unequal, there is a missing value before i, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
                – rici
                Dec 19 at 2:05
















              1












              1








              1






              Step 1: Sort the vector.



              That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.



              Step 2: Find the first int not in the vector.



              Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:



              Pseudocode:



              SIZE = 40|400|4000 // The one you are using
              for (int i = 0; i < SIZE; i++) {
              if (array[i] != INT_MIN + i)
              return INT_MIN + i;


              The solution would be O(n log(n) + n) meaning: O(n log(n))





              Edit: just read your edit asking for something better than O(n log(n)), sorry.






              share|improve this answer














              Step 1: Sort the vector.



              That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.



              Step 2: Find the first int not in the vector.



              Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:



              Pseudocode:



              SIZE = 40|400|4000 // The one you are using
              for (int i = 0; i < SIZE; i++) {
              if (array[i] != INT_MIN + i)
              return INT_MIN + i;


              The solution would be O(n log(n) + n) meaning: O(n log(n))





              Edit: just read your edit asking for something better than O(n log(n)), sorry.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Dec 18 at 15:14

























              answered Dec 18 at 14:52









              dquijada

              865820




              865820












              • Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem array[i] would make any sense when i starts from INT_MIN.
                – fleix
                Dec 18 at 15:10












              • You are right, @fleix, it's fixed now
                – dquijada
                Dec 18 at 15:14








              • 1




                You can do the search using binary search in O(log N).
                – rici
                Dec 18 at 20:25






              • 1




                @rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in O(n / max_n) time on average, but worst case n (no gaps until the end.)
                – Peter Cordes
                Dec 18 at 23:50






              • 2




                @peter: the same way you do a linear search, compare i + min with vec[i]. If they are unequal, there is a missing value before i, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
                – rici
                Dec 19 at 2:05




















              • Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem array[i] would make any sense when i starts from INT_MIN.
                – fleix
                Dec 18 at 15:10












              • You are right, @fleix, it's fixed now
                – dquijada
                Dec 18 at 15:14








              • 1




                You can do the search using binary search in O(log N).
                – rici
                Dec 18 at 20:25






              • 1




                @rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in O(n / max_n) time on average, but worst case n (no gaps until the end.)
                – Peter Cordes
                Dec 18 at 23:50






              • 2




                @peter: the same way you do a linear search, compare i + min with vec[i]. If they are unequal, there is a missing value before i, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
                – rici
                Dec 19 at 2:05


















              Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem array[i] would make any sense when i starts from INT_MIN.
              – fleix
              Dec 18 at 15:10






              Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem array[i] would make any sense when i starts from INT_MIN.
              – fleix
              Dec 18 at 15:10














              You are right, @fleix, it's fixed now
              – dquijada
              Dec 18 at 15:14






              You are right, @fleix, it's fixed now
              – dquijada
              Dec 18 at 15:14






              1




              1




              You can do the search using binary search in O(log N).
              – rici
              Dec 18 at 20:25




              You can do the search using binary search in O(log N).
              – rici
              Dec 18 at 20:25




              1




              1




              @rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in O(n / max_n) time on average, but worst case n (no gaps until the end.)
              – Peter Cordes
              Dec 18 at 23:50




              @rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in O(n / max_n) time on average, but worst case n (no gaps until the end.)
              – Peter Cordes
              Dec 18 at 23:50




              2




              2




              @peter: the same way you do a linear search, compare i + min with vec[i]. If they are unequal, there is a missing value before i, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
              – rici
              Dec 19 at 2:05






              @peter: the same way you do a linear search, compare i + min with vec[i]. If they are unequal, there is a missing value before i, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
              – rici
              Dec 19 at 2:05













              0














              For the case in which the integers are provided in an std::unordered_set<int> (as opposed to a std::vector<int>), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>. Searching for the presence of an integer in an std::unordered_set<int> is quite straightforward, since std::unodered_set does provide searching through its find() member function.



              The space complexity of this approach would be O(1).





              If you start traversing at the lowest possible value for an int (i.e., std::numeric_limits<int>::min()), you will obtain the lowest int not contained in the std::unordered_set<int>:



              int find_lowest_not_contained(const std::unordered_set<int>& set) {
              for (auto i = std::numeric_limits<int>::min(); ; ++i) {
              auto it = set.find(i); // search in set
              if (it == set.end()) // integer not in set?
              return *it;
              }
              }


              Analogously, if you start traversing at the greatest possible value for an int (i.e., std::numeric_limits<int>::max()), you will obtain the lowest int not contained in the std::unordered_set<int>:



              int find_greatest_not_contained(const std::unordered_set<int>& set) {
              for (auto i = std::numeric_limits<int>::max(); ; --i) {
              auto it = set.find(i); // search in set
              if (it == set.end()) // integer not in set?
              return *it;
              }
              }




              Assuming that the ints are uniformly mapped by the hash function into the unordered_set<int>'s buckets, a search operation on the unordered_set<int> can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int> (i.e., in your case M <= 4000).



              Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set, is guaranteed to come up against an integer value which is not present in the unordered_set<int>.






              share|improve this answer




























                0














                For the case in which the integers are provided in an std::unordered_set<int> (as opposed to a std::vector<int>), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>. Searching for the presence of an integer in an std::unordered_set<int> is quite straightforward, since std::unodered_set does provide searching through its find() member function.



                The space complexity of this approach would be O(1).





                If you start traversing at the lowest possible value for an int (i.e., std::numeric_limits<int>::min()), you will obtain the lowest int not contained in the std::unordered_set<int>:



                int find_lowest_not_contained(const std::unordered_set<int>& set) {
                for (auto i = std::numeric_limits<int>::min(); ; ++i) {
                auto it = set.find(i); // search in set
                if (it == set.end()) // integer not in set?
                return *it;
                }
                }


                Analogously, if you start traversing at the greatest possible value for an int (i.e., std::numeric_limits<int>::max()), you will obtain the lowest int not contained in the std::unordered_set<int>:



                int find_greatest_not_contained(const std::unordered_set<int>& set) {
                for (auto i = std::numeric_limits<int>::max(); ; --i) {
                auto it = set.find(i); // search in set
                if (it == set.end()) // integer not in set?
                return *it;
                }
                }




                Assuming that the ints are uniformly mapped by the hash function into the unordered_set<int>'s buckets, a search operation on the unordered_set<int> can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int> (i.e., in your case M <= 4000).



                Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set, is guaranteed to come up against an integer value which is not present in the unordered_set<int>.






                share|improve this answer


























                  0












                  0








                  0






                  For the case in which the integers are provided in an std::unordered_set<int> (as opposed to a std::vector<int>), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>. Searching for the presence of an integer in an std::unordered_set<int> is quite straightforward, since std::unodered_set does provide searching through its find() member function.



                  The space complexity of this approach would be O(1).





                  If you start traversing at the lowest possible value for an int (i.e., std::numeric_limits<int>::min()), you will obtain the lowest int not contained in the std::unordered_set<int>:



                  int find_lowest_not_contained(const std::unordered_set<int>& set) {
                  for (auto i = std::numeric_limits<int>::min(); ; ++i) {
                  auto it = set.find(i); // search in set
                  if (it == set.end()) // integer not in set?
                  return *it;
                  }
                  }


                  Analogously, if you start traversing at the greatest possible value for an int (i.e., std::numeric_limits<int>::max()), you will obtain the lowest int not contained in the std::unordered_set<int>:



                  int find_greatest_not_contained(const std::unordered_set<int>& set) {
                  for (auto i = std::numeric_limits<int>::max(); ; --i) {
                  auto it = set.find(i); // search in set
                  if (it == set.end()) // integer not in set?
                  return *it;
                  }
                  }




                  Assuming that the ints are uniformly mapped by the hash function into the unordered_set<int>'s buckets, a search operation on the unordered_set<int> can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int> (i.e., in your case M <= 4000).



                  Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set, is guaranteed to come up against an integer value which is not present in the unordered_set<int>.






                  share|improve this answer














                  For the case in which the integers are provided in an std::unordered_set<int> (as opposed to a std::vector<int>), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>. Searching for the presence of an integer in an std::unordered_set<int> is quite straightforward, since std::unodered_set does provide searching through its find() member function.



                  The space complexity of this approach would be O(1).





                  If you start traversing at the lowest possible value for an int (i.e., std::numeric_limits<int>::min()), you will obtain the lowest int not contained in the std::unordered_set<int>:



                  int find_lowest_not_contained(const std::unordered_set<int>& set) {
                  for (auto i = std::numeric_limits<int>::min(); ; ++i) {
                  auto it = set.find(i); // search in set
                  if (it == set.end()) // integer not in set?
                  return *it;
                  }
                  }


                  Analogously, if you start traversing at the greatest possible value for an int (i.e., std::numeric_limits<int>::max()), you will obtain the lowest int not contained in the std::unordered_set<int>:



                  int find_greatest_not_contained(const std::unordered_set<int>& set) {
                  for (auto i = std::numeric_limits<int>::max(); ; --i) {
                  auto it = set.find(i); // search in set
                  if (it == set.end()) // integer not in set?
                  return *it;
                  }
                  }




                  Assuming that the ints are uniformly mapped by the hash function into the unordered_set<int>'s buckets, a search operation on the unordered_set<int> can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int> (i.e., in your case M <= 4000).



                  Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set, is guaranteed to come up against an integer value which is not present in the unordered_set<int>.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Dec 19 at 9:50

























                  answered Dec 19 at 9:18









                  ネロク

                  10.5k32140




                  10.5k32140






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53835354%2fefficiently-find-an-integer-not-in-a-set-of-size-40-400-or-4000%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

                      Alcedinidae

                      RAC Tourist Trophy