incremental loading with new + update no delete in redshift columns that are updated at different times












0















I have a the following tables as an example:




  1. Events(id, business_uuid, event_category, event_name, created,
    row_updated)

  2. Bank_accounts(id, business_uuid, bank_name, Account_no, etc, created, row_updated)

  3. orders (id, uuid, business_uuid, qty, item, created, row_updated)


where created and row_updated are timestamps. for each business I want to capture the 1st time something happened for that business in each table. I am on redshift. I have my query like:



with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od



giving me an output of:
result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order)



the way these events occur is that they may get filled at different times, if I am pulling the data every 10 minutes, some columns will be null and later be filled in, I am dealing with a large table and I cant do full loads each time, I would like to do incremental loads (new + update) and hopefully no delete. I introduced a column last_update here result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order, last update)



and I rewrote the query above as:

with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
where row_updated > (select max(last_updated) from result)
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
where row_updated > (select max(last_updated) from result)
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
where row_updated > (select max(last_updated) from result)
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od

It does not work out when I compare my full load plus the attempted incremental load, there is slight difference in row counts.



if you find the above in clear manner, I would like a suggestion on how to best solve it.



for anything further info, I use Apache-Airflow as my ETL tool to execute my sql statements










share|improve this question



























    0















    I have a the following tables as an example:




    1. Events(id, business_uuid, event_category, event_name, created,
      row_updated)

    2. Bank_accounts(id, business_uuid, bank_name, Account_no, etc, created, row_updated)

    3. orders (id, uuid, business_uuid, qty, item, created, row_updated)


    where created and row_updated are timestamps. for each business I want to capture the 1st time something happened for that business in each table. I am on redshift. I have my query like:



    with universal as(
    select business_uuid from events
    union
    select business_uuid from bank_accounts
    union
    select business_uuid from orders
    )
    , ev as(
    select business_uuid
    , min(case when name = 'created_account' then created end) as created_account
    , min(case when name = 'enter_business_detail' then created end) as enter_business_detail
    , min(case when name = 'accepted_terms' then created end) as accepted_terms
    from events
    group by 1
    )
    , bank as(
    select business_uuid
    , min(row_updated) as entered_banking_details
    from Bank_accounts
    group by 1
    )
    , od as(
    select business_uuid
    , min(created) as first_order
    from orders
    group by 1
    )
    select u.business_uuid
    natural left join ev
    natural left join bank
    natural left join od



    giving me an output of:
    result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order)



    the way these events occur is that they may get filled at different times, if I am pulling the data every 10 minutes, some columns will be null and later be filled in, I am dealing with a large table and I cant do full loads each time, I would like to do incremental loads (new + update) and hopefully no delete. I introduced a column last_update here result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order, last update)



    and I rewrote the query above as:

    with universal as(
    select business_uuid from events
    union
    select business_uuid from bank_accounts
    union
    select business_uuid from orders
    )
    , ev as(
    select business_uuid
    , min(case when name = 'created_account' then created end) as created_account
    , min(case when name = 'enter_business_detail' then created end) as enter_business_detail
    , min(case when name = 'accepted_terms' then created end) as accepted_terms
    from events
    where row_updated > (select max(last_updated) from result)
    group by 1
    )
    , bank as(
    select business_uuid
    , min(row_updated) as entered_banking_details
    from Bank_accounts
    where row_updated > (select max(last_updated) from result)
    group by 1
    )
    , od as(
    select business_uuid
    , min(created) as first_order
    from orders
    where row_updated > (select max(last_updated) from result)
    group by 1
    )
    select u.business_uuid
    natural left join ev
    natural left join bank
    natural left join od

    It does not work out when I compare my full load plus the attempted incremental load, there is slight difference in row counts.



    if you find the above in clear manner, I would like a suggestion on how to best solve it.



    for anything further info, I use Apache-Airflow as my ETL tool to execute my sql statements










    share|improve this question

























      0












      0








      0








      I have a the following tables as an example:




      1. Events(id, business_uuid, event_category, event_name, created,
        row_updated)

      2. Bank_accounts(id, business_uuid, bank_name, Account_no, etc, created, row_updated)

      3. orders (id, uuid, business_uuid, qty, item, created, row_updated)


      where created and row_updated are timestamps. for each business I want to capture the 1st time something happened for that business in each table. I am on redshift. I have my query like:



      with universal as(
      select business_uuid from events
      union
      select business_uuid from bank_accounts
      union
      select business_uuid from orders
      )
      , ev as(
      select business_uuid
      , min(case when name = 'created_account' then created end) as created_account
      , min(case when name = 'enter_business_detail' then created end) as enter_business_detail
      , min(case when name = 'accepted_terms' then created end) as accepted_terms
      from events
      group by 1
      )
      , bank as(
      select business_uuid
      , min(row_updated) as entered_banking_details
      from Bank_accounts
      group by 1
      )
      , od as(
      select business_uuid
      , min(created) as first_order
      from orders
      group by 1
      )
      select u.business_uuid
      natural left join ev
      natural left join bank
      natural left join od



      giving me an output of:
      result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order)



      the way these events occur is that they may get filled at different times, if I am pulling the data every 10 minutes, some columns will be null and later be filled in, I am dealing with a large table and I cant do full loads each time, I would like to do incremental loads (new + update) and hopefully no delete. I introduced a column last_update here result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order, last update)



      and I rewrote the query above as:

      with universal as(
      select business_uuid from events
      union
      select business_uuid from bank_accounts
      union
      select business_uuid from orders
      )
      , ev as(
      select business_uuid
      , min(case when name = 'created_account' then created end) as created_account
      , min(case when name = 'enter_business_detail' then created end) as enter_business_detail
      , min(case when name = 'accepted_terms' then created end) as accepted_terms
      from events
      where row_updated > (select max(last_updated) from result)
      group by 1
      )
      , bank as(
      select business_uuid
      , min(row_updated) as entered_banking_details
      from Bank_accounts
      where row_updated > (select max(last_updated) from result)
      group by 1
      )
      , od as(
      select business_uuid
      , min(created) as first_order
      from orders
      where row_updated > (select max(last_updated) from result)
      group by 1
      )
      select u.business_uuid
      natural left join ev
      natural left join bank
      natural left join od

      It does not work out when I compare my full load plus the attempted incremental load, there is slight difference in row counts.



      if you find the above in clear manner, I would like a suggestion on how to best solve it.



      for anything further info, I use Apache-Airflow as my ETL tool to execute my sql statements










      share|improve this question














      I have a the following tables as an example:




      1. Events(id, business_uuid, event_category, event_name, created,
        row_updated)

      2. Bank_accounts(id, business_uuid, bank_name, Account_no, etc, created, row_updated)

      3. orders (id, uuid, business_uuid, qty, item, created, row_updated)


      where created and row_updated are timestamps. for each business I want to capture the 1st time something happened for that business in each table. I am on redshift. I have my query like:



      with universal as(
      select business_uuid from events
      union
      select business_uuid from bank_accounts
      union
      select business_uuid from orders
      )
      , ev as(
      select business_uuid
      , min(case when name = 'created_account' then created end) as created_account
      , min(case when name = 'enter_business_detail' then created end) as enter_business_detail
      , min(case when name = 'accepted_terms' then created end) as accepted_terms
      from events
      group by 1
      )
      , bank as(
      select business_uuid
      , min(row_updated) as entered_banking_details
      from Bank_accounts
      group by 1
      )
      , od as(
      select business_uuid
      , min(created) as first_order
      from orders
      group by 1
      )
      select u.business_uuid
      natural left join ev
      natural left join bank
      natural left join od



      giving me an output of:
      result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order)



      the way these events occur is that they may get filled at different times, if I am pulling the data every 10 minutes, some columns will be null and later be filled in, I am dealing with a large table and I cant do full loads each time, I would like to do incremental loads (new + update) and hopefully no delete. I introduced a column last_update here result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order, last update)



      and I rewrote the query above as:

      with universal as(
      select business_uuid from events
      union
      select business_uuid from bank_accounts
      union
      select business_uuid from orders
      )
      , ev as(
      select business_uuid
      , min(case when name = 'created_account' then created end) as created_account
      , min(case when name = 'enter_business_detail' then created end) as enter_business_detail
      , min(case when name = 'accepted_terms' then created end) as accepted_terms
      from events
      where row_updated > (select max(last_updated) from result)
      group by 1
      )
      , bank as(
      select business_uuid
      , min(row_updated) as entered_banking_details
      from Bank_accounts
      where row_updated > (select max(last_updated) from result)
      group by 1
      )
      , od as(
      select business_uuid
      , min(created) as first_order
      from orders
      where row_updated > (select max(last_updated) from result)
      group by 1
      )
      select u.business_uuid
      natural left join ev
      natural left join bank
      natural left join od

      It does not work out when I compare my full load plus the attempted incremental load, there is slight difference in row counts.



      if you find the above in clear manner, I would like a suggestion on how to best solve it.



      for anything further info, I use Apache-Airflow as my ETL tool to execute my sql statements







      sql amazon-redshift incremental-build






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 '18 at 12:36









      PlengoPlengo

      479




      479
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53412185%2fincremental-loading-with-new-update-no-delete-in-redshift-columns-that-are-upd%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53412185%2fincremental-loading-with-new-update-no-delete-in-redshift-columns-that-are-upd%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

          Alcedinidae

          Origin of the phrase “under your belt”?