What is a scalable way to simulate HASHBYTES using a SQL CLR scalar function?












11















As part of our ETL process, we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.



The comparison is based on the unique key of the table and some kind of hash of all of the other columns. We currently use HASHBYTES with the SHA2_256 algorithm and have found that it does not scale on large servers if many concurrent worker threads are all calling HASHBYTES.



Throughput measured in hashes per second does not increase past 16 concurrent threads when testing on a 96 core server. I test by changing the number of concurrent MAXDOP 8 queries from 1 - 12. Testing with MAXDOP 1 showed the same scalability bottleneck.



As a workaround I want to try a SQL CLR solution. Here is my attempt to state the requirements:




  • The function must be able to participate in parallel queries

  • The function must be deterministic

  • The function must take an input of an NVARCHAR or VARBINARY string (all relevant columns are concatenated together)

  • The typical input size of the string will be 100 - 20000 characters in length. 20000 is not a max

  • The chance of a hash collision should be roughly equal to or better than the MD5 algorithm. CHECKSUM does not work for us because there are too many collisions.

  • The function must scale well on large servers (throughput per thread should not significantly decrease as the number of threads increases)


For Application Reasons™, assume that I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns (there are other problems as well that I don't want to get into).



What is a scalable way to simulate HASHBYTES using a SQL CLR function? My goal can be expressed as getting as many hashes per second as I can on a large server, so performance matters as well. I am terrible with CLR so I don't know how to accomplish this. If it motivates anyone to answer, I plan on adding a bounty to this question as soon as I am able. Below is an example query which very roughly illustrates the use case:



DROP TABLE IF EXISTS #CHANGED_IDS;

SELECT stg.ID INTO #CHANGED_IDS
FROM (
SELECT ID,
CAST( HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)))
AS BINARY(32)) HASH1
FROM HB_TBL WITH (TABLOCK)
) stg
INNER JOIN (
SELECT ID,
CAST(HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)) )
AS BINARY(32)) HASH1
FROM HB_TBL_2 WITH (TABLOCK)
) rpt ON rpt.ID = stg.ID
WHERE rpt.HASH1 <> stg.HASH1
OPTION (MAXDOP 8);


To simplify things a bit, I'll probably use something like the following for benchmarking. I'll post results with HASHBYTES on Monday:



CREATE TABLE dbo.HASH_ME (
ID BIGINT NOT NULL,
FK1 BIGINT NOT NULL,
FK2 BIGINT NOT NULL,
FK3 BIGINT NOT NULL,
FK4 BIGINT NOT NULL,
FK5 BIGINT NOT NULL,
FK6 BIGINT NOT NULL,
FK7 BIGINT NOT NULL,
FK8 BIGINT NOT NULL,
FK9 BIGINT NOT NULL,
FK10 BIGINT NOT NULL,
FK11 BIGINT NOT NULL,
FK12 BIGINT NOT NULL,
FK13 BIGINT NOT NULL,
FK14 BIGINT NOT NULL,
FK15 BIGINT NOT NULL,
STR1 NVARCHAR(500) NOT NULL,
STR2 NVARCHAR(500) NOT NULL,
STR3 NVARCHAR(500) NOT NULL,
STR4 NVARCHAR(500) NOT NULL,
STR5 NVARCHAR(2000) NOT NULL,
COMP1 TINYINT NOT NULL,
COMP2 TINYINT NOT NULL,
COMP3 TINYINT NOT NULL,
COMP4 TINYINT NOT NULL,
COMP5 TINYINT NOT NULL
);

INSERT INTO dbo.HASH_ME WITH (TABLOCK)
SELECT RN,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 1000),
0,1,0,1,0
FROM (
SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

SELECT MAX(HASHBYTES('SHA2_256',
CAST(N'' AS NVARCHAR(MAX)) + N'|' +
CAST(FK1 AS NVARCHAR(19)) + N'|' +
CAST(FK2 AS NVARCHAR(19)) + N'|' +
CAST(FK3 AS NVARCHAR(19)) + N'|' +
CAST(FK4 AS NVARCHAR(19)) + N'|' +
CAST(FK5 AS NVARCHAR(19)) + N'|' +
CAST(FK6 AS NVARCHAR(19)) + N'|' +
CAST(FK7 AS NVARCHAR(19)) + N'|' +
CAST(FK8 AS NVARCHAR(19)) + N'|' +
CAST(FK9 AS NVARCHAR(19)) + N'|' +
CAST(FK10 AS NVARCHAR(19)) + N'|' +
CAST(FK11 AS NVARCHAR(19)) + N'|' +
CAST(FK12 AS NVARCHAR(19)) + N'|' +
CAST(FK13 AS NVARCHAR(19)) + N'|' +
CAST(FK14 AS NVARCHAR(19)) + N'|' +
CAST(FK15 AS NVARCHAR(19)) + N'|' +
CAST(STR1 AS NVARCHAR(500)) + N'|' +
CAST(STR2 AS NVARCHAR(500)) + N'|' +
CAST(STR3 AS NVARCHAR(500)) + N'|' +
CAST(STR4 AS NVARCHAR(500)) + N'|' +
CAST(STR5 AS NVARCHAR(2000)) + N'|' +
CAST(COMP1 AS NVARCHAR(1)) + N'|' +
CAST(COMP2 AS NVARCHAR(1)) + N'|' +
CAST(COMP3 AS NVARCHAR(1)) + N'|' +
CAST(COMP4 AS NVARCHAR(1)) + N'|' +
CAST(COMP5 AS NVARCHAR(1)) )
)
FROM dbo.HASH_ME
OPTION (MAXDOP 1);









share|improve this question

























  • Would adding a Temporal to reporting tables be an option here?

    – dean
    19 hours ago
















11















As part of our ETL process, we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.



The comparison is based on the unique key of the table and some kind of hash of all of the other columns. We currently use HASHBYTES with the SHA2_256 algorithm and have found that it does not scale on large servers if many concurrent worker threads are all calling HASHBYTES.



Throughput measured in hashes per second does not increase past 16 concurrent threads when testing on a 96 core server. I test by changing the number of concurrent MAXDOP 8 queries from 1 - 12. Testing with MAXDOP 1 showed the same scalability bottleneck.



As a workaround I want to try a SQL CLR solution. Here is my attempt to state the requirements:




  • The function must be able to participate in parallel queries

  • The function must be deterministic

  • The function must take an input of an NVARCHAR or VARBINARY string (all relevant columns are concatenated together)

  • The typical input size of the string will be 100 - 20000 characters in length. 20000 is not a max

  • The chance of a hash collision should be roughly equal to or better than the MD5 algorithm. CHECKSUM does not work for us because there are too many collisions.

  • The function must scale well on large servers (throughput per thread should not significantly decrease as the number of threads increases)


For Application Reasons™, assume that I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns (there are other problems as well that I don't want to get into).



What is a scalable way to simulate HASHBYTES using a SQL CLR function? My goal can be expressed as getting as many hashes per second as I can on a large server, so performance matters as well. I am terrible with CLR so I don't know how to accomplish this. If it motivates anyone to answer, I plan on adding a bounty to this question as soon as I am able. Below is an example query which very roughly illustrates the use case:



DROP TABLE IF EXISTS #CHANGED_IDS;

SELECT stg.ID INTO #CHANGED_IDS
FROM (
SELECT ID,
CAST( HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)))
AS BINARY(32)) HASH1
FROM HB_TBL WITH (TABLOCK)
) stg
INNER JOIN (
SELECT ID,
CAST(HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)) )
AS BINARY(32)) HASH1
FROM HB_TBL_2 WITH (TABLOCK)
) rpt ON rpt.ID = stg.ID
WHERE rpt.HASH1 <> stg.HASH1
OPTION (MAXDOP 8);


To simplify things a bit, I'll probably use something like the following for benchmarking. I'll post results with HASHBYTES on Monday:



CREATE TABLE dbo.HASH_ME (
ID BIGINT NOT NULL,
FK1 BIGINT NOT NULL,
FK2 BIGINT NOT NULL,
FK3 BIGINT NOT NULL,
FK4 BIGINT NOT NULL,
FK5 BIGINT NOT NULL,
FK6 BIGINT NOT NULL,
FK7 BIGINT NOT NULL,
FK8 BIGINT NOT NULL,
FK9 BIGINT NOT NULL,
FK10 BIGINT NOT NULL,
FK11 BIGINT NOT NULL,
FK12 BIGINT NOT NULL,
FK13 BIGINT NOT NULL,
FK14 BIGINT NOT NULL,
FK15 BIGINT NOT NULL,
STR1 NVARCHAR(500) NOT NULL,
STR2 NVARCHAR(500) NOT NULL,
STR3 NVARCHAR(500) NOT NULL,
STR4 NVARCHAR(500) NOT NULL,
STR5 NVARCHAR(2000) NOT NULL,
COMP1 TINYINT NOT NULL,
COMP2 TINYINT NOT NULL,
COMP3 TINYINT NOT NULL,
COMP4 TINYINT NOT NULL,
COMP5 TINYINT NOT NULL
);

INSERT INTO dbo.HASH_ME WITH (TABLOCK)
SELECT RN,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 1000),
0,1,0,1,0
FROM (
SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

SELECT MAX(HASHBYTES('SHA2_256',
CAST(N'' AS NVARCHAR(MAX)) + N'|' +
CAST(FK1 AS NVARCHAR(19)) + N'|' +
CAST(FK2 AS NVARCHAR(19)) + N'|' +
CAST(FK3 AS NVARCHAR(19)) + N'|' +
CAST(FK4 AS NVARCHAR(19)) + N'|' +
CAST(FK5 AS NVARCHAR(19)) + N'|' +
CAST(FK6 AS NVARCHAR(19)) + N'|' +
CAST(FK7 AS NVARCHAR(19)) + N'|' +
CAST(FK8 AS NVARCHAR(19)) + N'|' +
CAST(FK9 AS NVARCHAR(19)) + N'|' +
CAST(FK10 AS NVARCHAR(19)) + N'|' +
CAST(FK11 AS NVARCHAR(19)) + N'|' +
CAST(FK12 AS NVARCHAR(19)) + N'|' +
CAST(FK13 AS NVARCHAR(19)) + N'|' +
CAST(FK14 AS NVARCHAR(19)) + N'|' +
CAST(FK15 AS NVARCHAR(19)) + N'|' +
CAST(STR1 AS NVARCHAR(500)) + N'|' +
CAST(STR2 AS NVARCHAR(500)) + N'|' +
CAST(STR3 AS NVARCHAR(500)) + N'|' +
CAST(STR4 AS NVARCHAR(500)) + N'|' +
CAST(STR5 AS NVARCHAR(2000)) + N'|' +
CAST(COMP1 AS NVARCHAR(1)) + N'|' +
CAST(COMP2 AS NVARCHAR(1)) + N'|' +
CAST(COMP3 AS NVARCHAR(1)) + N'|' +
CAST(COMP4 AS NVARCHAR(1)) + N'|' +
CAST(COMP5 AS NVARCHAR(1)) )
)
FROM dbo.HASH_ME
OPTION (MAXDOP 1);









share|improve this question

























  • Would adding a Temporal to reporting tables be an option here?

    – dean
    19 hours ago














11












11








11


2






As part of our ETL process, we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.



The comparison is based on the unique key of the table and some kind of hash of all of the other columns. We currently use HASHBYTES with the SHA2_256 algorithm and have found that it does not scale on large servers if many concurrent worker threads are all calling HASHBYTES.



Throughput measured in hashes per second does not increase past 16 concurrent threads when testing on a 96 core server. I test by changing the number of concurrent MAXDOP 8 queries from 1 - 12. Testing with MAXDOP 1 showed the same scalability bottleneck.



As a workaround I want to try a SQL CLR solution. Here is my attempt to state the requirements:




  • The function must be able to participate in parallel queries

  • The function must be deterministic

  • The function must take an input of an NVARCHAR or VARBINARY string (all relevant columns are concatenated together)

  • The typical input size of the string will be 100 - 20000 characters in length. 20000 is not a max

  • The chance of a hash collision should be roughly equal to or better than the MD5 algorithm. CHECKSUM does not work for us because there are too many collisions.

  • The function must scale well on large servers (throughput per thread should not significantly decrease as the number of threads increases)


For Application Reasons™, assume that I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns (there are other problems as well that I don't want to get into).



What is a scalable way to simulate HASHBYTES using a SQL CLR function? My goal can be expressed as getting as many hashes per second as I can on a large server, so performance matters as well. I am terrible with CLR so I don't know how to accomplish this. If it motivates anyone to answer, I plan on adding a bounty to this question as soon as I am able. Below is an example query which very roughly illustrates the use case:



DROP TABLE IF EXISTS #CHANGED_IDS;

SELECT stg.ID INTO #CHANGED_IDS
FROM (
SELECT ID,
CAST( HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)))
AS BINARY(32)) HASH1
FROM HB_TBL WITH (TABLOCK)
) stg
INNER JOIN (
SELECT ID,
CAST(HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)) )
AS BINARY(32)) HASH1
FROM HB_TBL_2 WITH (TABLOCK)
) rpt ON rpt.ID = stg.ID
WHERE rpt.HASH1 <> stg.HASH1
OPTION (MAXDOP 8);


To simplify things a bit, I'll probably use something like the following for benchmarking. I'll post results with HASHBYTES on Monday:



CREATE TABLE dbo.HASH_ME (
ID BIGINT NOT NULL,
FK1 BIGINT NOT NULL,
FK2 BIGINT NOT NULL,
FK3 BIGINT NOT NULL,
FK4 BIGINT NOT NULL,
FK5 BIGINT NOT NULL,
FK6 BIGINT NOT NULL,
FK7 BIGINT NOT NULL,
FK8 BIGINT NOT NULL,
FK9 BIGINT NOT NULL,
FK10 BIGINT NOT NULL,
FK11 BIGINT NOT NULL,
FK12 BIGINT NOT NULL,
FK13 BIGINT NOT NULL,
FK14 BIGINT NOT NULL,
FK15 BIGINT NOT NULL,
STR1 NVARCHAR(500) NOT NULL,
STR2 NVARCHAR(500) NOT NULL,
STR3 NVARCHAR(500) NOT NULL,
STR4 NVARCHAR(500) NOT NULL,
STR5 NVARCHAR(2000) NOT NULL,
COMP1 TINYINT NOT NULL,
COMP2 TINYINT NOT NULL,
COMP3 TINYINT NOT NULL,
COMP4 TINYINT NOT NULL,
COMP5 TINYINT NOT NULL
);

INSERT INTO dbo.HASH_ME WITH (TABLOCK)
SELECT RN,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 1000),
0,1,0,1,0
FROM (
SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

SELECT MAX(HASHBYTES('SHA2_256',
CAST(N'' AS NVARCHAR(MAX)) + N'|' +
CAST(FK1 AS NVARCHAR(19)) + N'|' +
CAST(FK2 AS NVARCHAR(19)) + N'|' +
CAST(FK3 AS NVARCHAR(19)) + N'|' +
CAST(FK4 AS NVARCHAR(19)) + N'|' +
CAST(FK5 AS NVARCHAR(19)) + N'|' +
CAST(FK6 AS NVARCHAR(19)) + N'|' +
CAST(FK7 AS NVARCHAR(19)) + N'|' +
CAST(FK8 AS NVARCHAR(19)) + N'|' +
CAST(FK9 AS NVARCHAR(19)) + N'|' +
CAST(FK10 AS NVARCHAR(19)) + N'|' +
CAST(FK11 AS NVARCHAR(19)) + N'|' +
CAST(FK12 AS NVARCHAR(19)) + N'|' +
CAST(FK13 AS NVARCHAR(19)) + N'|' +
CAST(FK14 AS NVARCHAR(19)) + N'|' +
CAST(FK15 AS NVARCHAR(19)) + N'|' +
CAST(STR1 AS NVARCHAR(500)) + N'|' +
CAST(STR2 AS NVARCHAR(500)) + N'|' +
CAST(STR3 AS NVARCHAR(500)) + N'|' +
CAST(STR4 AS NVARCHAR(500)) + N'|' +
CAST(STR5 AS NVARCHAR(2000)) + N'|' +
CAST(COMP1 AS NVARCHAR(1)) + N'|' +
CAST(COMP2 AS NVARCHAR(1)) + N'|' +
CAST(COMP3 AS NVARCHAR(1)) + N'|' +
CAST(COMP4 AS NVARCHAR(1)) + N'|' +
CAST(COMP5 AS NVARCHAR(1)) )
)
FROM dbo.HASH_ME
OPTION (MAXDOP 1);









share|improve this question
















As part of our ETL process, we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.



The comparison is based on the unique key of the table and some kind of hash of all of the other columns. We currently use HASHBYTES with the SHA2_256 algorithm and have found that it does not scale on large servers if many concurrent worker threads are all calling HASHBYTES.



Throughput measured in hashes per second does not increase past 16 concurrent threads when testing on a 96 core server. I test by changing the number of concurrent MAXDOP 8 queries from 1 - 12. Testing with MAXDOP 1 showed the same scalability bottleneck.



As a workaround I want to try a SQL CLR solution. Here is my attempt to state the requirements:




  • The function must be able to participate in parallel queries

  • The function must be deterministic

  • The function must take an input of an NVARCHAR or VARBINARY string (all relevant columns are concatenated together)

  • The typical input size of the string will be 100 - 20000 characters in length. 20000 is not a max

  • The chance of a hash collision should be roughly equal to or better than the MD5 algorithm. CHECKSUM does not work for us because there are too many collisions.

  • The function must scale well on large servers (throughput per thread should not significantly decrease as the number of threads increases)


For Application Reasons™, assume that I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns (there are other problems as well that I don't want to get into).



What is a scalable way to simulate HASHBYTES using a SQL CLR function? My goal can be expressed as getting as many hashes per second as I can on a large server, so performance matters as well. I am terrible with CLR so I don't know how to accomplish this. If it motivates anyone to answer, I plan on adding a bounty to this question as soon as I am able. Below is an example query which very roughly illustrates the use case:



DROP TABLE IF EXISTS #CHANGED_IDS;

SELECT stg.ID INTO #CHANGED_IDS
FROM (
SELECT ID,
CAST( HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)))
AS BINARY(32)) HASH1
FROM HB_TBL WITH (TABLOCK)
) stg
INNER JOIN (
SELECT ID,
CAST(HASHBYTES ('SHA2_256',
CAST(FK1 AS NVARCHAR(19)) +
CAST(FK2 AS NVARCHAR(19)) +
CAST(FK3 AS NVARCHAR(19)) +
CAST(FK4 AS NVARCHAR(19)) +
CAST(FK5 AS NVARCHAR(19)) +
CAST(FK6 AS NVARCHAR(19)) +
CAST(FK7 AS NVARCHAR(19)) +
CAST(FK8 AS NVARCHAR(19)) +
CAST(FK9 AS NVARCHAR(19)) +
CAST(FK10 AS NVARCHAR(19)) +
CAST(FK11 AS NVARCHAR(19)) +
CAST(FK12 AS NVARCHAR(19)) +
CAST(FK13 AS NVARCHAR(19)) +
CAST(FK14 AS NVARCHAR(19)) +
CAST(FK15 AS NVARCHAR(19)) +
CAST(STR1 AS NVARCHAR(500)) +
CAST(STR2 AS NVARCHAR(500)) +
CAST(STR3 AS NVARCHAR(500)) +
CAST(STR4 AS NVARCHAR(500)) +
CAST(STR5 AS NVARCHAR(500)) +
CAST(COMP1 AS NVARCHAR(1)) +
CAST(COMP2 AS NVARCHAR(1)) +
CAST(COMP3 AS NVARCHAR(1)) +
CAST(COMP4 AS NVARCHAR(1)) +
CAST(COMP5 AS NVARCHAR(1)) )
AS BINARY(32)) HASH1
FROM HB_TBL_2 WITH (TABLOCK)
) rpt ON rpt.ID = stg.ID
WHERE rpt.HASH1 <> stg.HASH1
OPTION (MAXDOP 8);


To simplify things a bit, I'll probably use something like the following for benchmarking. I'll post results with HASHBYTES on Monday:



CREATE TABLE dbo.HASH_ME (
ID BIGINT NOT NULL,
FK1 BIGINT NOT NULL,
FK2 BIGINT NOT NULL,
FK3 BIGINT NOT NULL,
FK4 BIGINT NOT NULL,
FK5 BIGINT NOT NULL,
FK6 BIGINT NOT NULL,
FK7 BIGINT NOT NULL,
FK8 BIGINT NOT NULL,
FK9 BIGINT NOT NULL,
FK10 BIGINT NOT NULL,
FK11 BIGINT NOT NULL,
FK12 BIGINT NOT NULL,
FK13 BIGINT NOT NULL,
FK14 BIGINT NOT NULL,
FK15 BIGINT NOT NULL,
STR1 NVARCHAR(500) NOT NULL,
STR2 NVARCHAR(500) NOT NULL,
STR3 NVARCHAR(500) NOT NULL,
STR4 NVARCHAR(500) NOT NULL,
STR5 NVARCHAR(2000) NOT NULL,
COMP1 TINYINT NOT NULL,
COMP2 TINYINT NOT NULL,
COMP3 TINYINT NOT NULL,
COMP4 TINYINT NOT NULL,
COMP5 TINYINT NOT NULL
);

INSERT INTO dbo.HASH_ME WITH (TABLOCK)
SELECT RN,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000, RN % 1000000,
REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 30)
,REPLICATE(CHAR(65 + RN % 10 ), 1000),
0,1,0,1,0
FROM (
SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

SELECT MAX(HASHBYTES('SHA2_256',
CAST(N'' AS NVARCHAR(MAX)) + N'|' +
CAST(FK1 AS NVARCHAR(19)) + N'|' +
CAST(FK2 AS NVARCHAR(19)) + N'|' +
CAST(FK3 AS NVARCHAR(19)) + N'|' +
CAST(FK4 AS NVARCHAR(19)) + N'|' +
CAST(FK5 AS NVARCHAR(19)) + N'|' +
CAST(FK6 AS NVARCHAR(19)) + N'|' +
CAST(FK7 AS NVARCHAR(19)) + N'|' +
CAST(FK8 AS NVARCHAR(19)) + N'|' +
CAST(FK9 AS NVARCHAR(19)) + N'|' +
CAST(FK10 AS NVARCHAR(19)) + N'|' +
CAST(FK11 AS NVARCHAR(19)) + N'|' +
CAST(FK12 AS NVARCHAR(19)) + N'|' +
CAST(FK13 AS NVARCHAR(19)) + N'|' +
CAST(FK14 AS NVARCHAR(19)) + N'|' +
CAST(FK15 AS NVARCHAR(19)) + N'|' +
CAST(STR1 AS NVARCHAR(500)) + N'|' +
CAST(STR2 AS NVARCHAR(500)) + N'|' +
CAST(STR3 AS NVARCHAR(500)) + N'|' +
CAST(STR4 AS NVARCHAR(500)) + N'|' +
CAST(STR5 AS NVARCHAR(2000)) + N'|' +
CAST(COMP1 AS NVARCHAR(1)) + N'|' +
CAST(COMP2 AS NVARCHAR(1)) + N'|' +
CAST(COMP3 AS NVARCHAR(1)) + N'|' +
CAST(COMP4 AS NVARCHAR(1)) + N'|' +
CAST(COMP5 AS NVARCHAR(1)) )
)
FROM dbo.HASH_ME
OPTION (MAXDOP 1);






sql-server sql-server-2016 etl sql-clr hashing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday







Joe Obbish

















asked yesterday









Joe ObbishJoe Obbish

20.9k32883




20.9k32883













  • Would adding a Temporal to reporting tables be an option here?

    – dean
    19 hours ago



















  • Would adding a Temporal to reporting tables be an option here?

    – dean
    19 hours ago

















Would adding a Temporal to reporting tables be an option here?

– dean
19 hours ago





Would adding a Temporal to reporting tables be an option here?

– dean
19 hours ago










2 Answers
2






active

oldest

votes


















7














I'm not sure if parallelism will be any / significantly better with SQLCLR. However, it is really easy to test since there is a hash function in the Free version of the SQL# SQLCLR library (which I wrote) called Util_HashBinary. Supported algorithms are: MD5, SHA1, SHA256, SHA384, and SHA512.



It takes a VARBINARY(MAX) value as input, so you can either concatenate the string version of each field (as you are currently doing) and then convert to VARBINARY(MAX), or you can go directly to VARBINARY for each column and concatenate the converted values (this might be faster since you aren't dealing with strings or the extra conversion from string to VARBINARY). Below is an example showing both of these options. It also shows the HASHBYTES function so you can see that the values are the same between it and SQL#.Util_HashBinary.



Please note that the hash results when concatenating the VARBINARY values won't match the hash results when concatenating the NVARCHAR values. This is because the binary form of the INT value "1" is 0x00000001, while the UTF-16LE (i.e. NVARCHAR) form of the INT value of "1" (in binary form since that is what a hashing function will operate on) is 0x3100.



SELECT so.[object_id],
SQL#.Util_HashBinary(N'SHA256',
CONVERT(VARBINARY(MAX),
CONCAT(so.[name], so.[schema_id], so.[create_date])
)
) AS [SQLCLR-ConcatStrings],
HASHBYTES(N'SHA2_256',
CONVERT(VARBINARY(MAX),
CONCAT(so.[name], so.[schema_id], so.[create_date])
)
) AS [BuiltIn-ConcatStrings]
FROM sys.objects so;


SELECT so.[object_id],
SQL#.Util_HashBinary(N'SHA256',
CONVERT(VARBINARY(500), so.[name]) +
CONVERT(VARBINARY(500), so.[schema_id]) +
CONVERT(VARBINARY(500), so.[create_date])
) AS [SQLCLR-ConcatVarBinaries],
HASHBYTES(N'SHA2_256',
CONVERT(VARBINARY(500), so.[name]) +
CONVERT(VARBINARY(500), so.[schema_id]) +
CONVERT(VARBINARY(500), so.[create_date])
) AS [BuiltIn-ConcatVarBinaries]
FROM sys.objects so;


Beyond that aspect of the question, there are some additional thoughts that might help this process that are not related to SQLCLR. You mentioned a few things:






  1. we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.




    and:





  2. I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns




    and:




  3. the tables can be updated outside of the ETL process




It sounds like the data in this reporting table is stable for a period of time, and is only modified by this ETL process.



If nothing else modifies this table, then we really don't need a trigger or indexed view after all (I originally thought that you might).



Since you can't modify the schema of the reporting table, would it at least be possible to create a related table to contain the pre-calculated hash (and UTC time of when it was calculated)? This would allow you to have a pre-calculated value to compare against next time, leaving only the incoming value that requires calculating the hash of. This would reduce the number of calls to either HASHBYTES or SQL#.Util_HashBinary by half. You would simply join to this table of hashes during the import process.



You would also create a separate stored procedure that simply refreshes the hashes of this table. It just updates the hashes of any related row that has changed to be current, and updates the timestamp for those modified rows. This proc can/should be executed at the end of any other process that updates this table. It can also be scheduled to run 30 - 60 minutes prior to this ETL starting (depending on how long it takes to execute, and when any of these other processes might run). It can even be executed manually if you ever suspect there might be rows that are out of sync.



ALSO: Paul White, in a comment on this answer, mentioned:




One downside of replacing HASHBYTES with a CLR scalar function - it appears that CLR functions cannot use batch mode whereas HASHBYTES can. That might be important, performance-wise.




So that is something to consider, and clearly requires testing.



ALSO: regardless of SQLCLR vs built-in HASHBYTES, I would still recommend converting directly to VARBINARY as that should be faster. Concatenating strings is just not terribly efficient. And, that's in addition to converting non-string values into strings in the first place, which requires extra effort (I assume the amount of effort varies based on the base type: DATETIME requiring more than BIGINT), whereas converting to VARBINARY simply gives you the underlying value (in most cases).






share|improve this answer


























  • Hi Solomon! Thanks for your answer. Assuming I can get everything installed I'll test this CLR on Monday. Do you mind sharing if your code calls into bcrypt at all? On the other point, due to how the application works I would need to use one of the following solutions: trigger, indexed view, or persisted computed column. If index maintenance could be done in a parallel zone that approach would be much more attractive to me. There are a lot of "local factors" at play which make storing the value difficult.

    – Joe Obbish
    yesterday











  • Hi @JoeObbish . re: using bcrypt, no not unless it is used behind the scenes by the managed .NET classes. Regarding the storage of pre-computed values: can you please clarify whether or not these reporting tables can be updated by anything outside of this ETL process? And if so, would it be a regularly scheduled thing (like another ETL process), or open-ended (more like an OLTP setup)? Also, might be worth testing a combination of both HASHBYTES and SQLCLR together to see if the thread limitation is specific to where the computation is actually happening.

    – Solomon Rutzky
    yesterday











  • Yes, the tables can be updated outside of the ETL process. Usually it would be something scheduled. A combination of HASHBYTES and CLR is an interesting idea, I'll include that as a benchmark. I don't have access to the test server right now so the earliest would be tomorrow.

    – Joe Obbish
    yesterday











  • @JoeObbish I'm in no rush ;-). Also, if the other scenarios that could update the table are controlled scenarios, then I believe my suggestion would still work. I had meant to mention that you can create a separate proc that simply refreshes the hashes of this table if you ever suspect they are out of sync. It just updates the hashes of anything that has changed to be current and updates the timestamp. You can have that proc executed at the end of any other process that updates this table, or can be schedule to run 30 - 60 minutes prior to this ETL starting, or manually, etc.

    – Solomon Rutzky
    yesterday











  • Also, I assume that there are at least 2 reporting tables you need to do this with, given that the schema noted in the question does not add up to 20,000 characters in a single row ;-)

    – Solomon Rutzky
    yesterday



















3














Since you're just looking for changes, you don't need a cryptographic hash function.



You could choose from one of the faster non-cryptographic hashes in the Data.HashFunction library by Brandon Dahler. SpookyHash is a popular choice.



Example implementation



Source Code



using Microsoft.SqlServer.Server;
using System.Data.HashFunction.SpookyHash;
using System.Data.SqlTypes;

public partial class UserDefinedFunctions
{
[SqlFunction
(
DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None,
IsDeterministic = true,
IsPrecise = true
)
]
public static byte SpookyHash
(
[SqlFacet (MaxSize = 8000)]
SqlBinary Input
)
{
ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
return sh.ComputeHash(Input.Value).Hash;
}

[SqlFunction
(
DataAccess = DataAccessKind.None,
IsDeterministic = true,
IsPrecise = true,
SystemDataAccess = SystemDataAccessKind.None
)
]
public static byte SpookyHashLOB
(
[SqlFacet (MaxSize = -1)]
SqlBinary Input
)
{
ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
return sh.ComputeHash(Input.Value).Hash;
}
}


The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version.



Pre-built code



You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:



https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300



T-SQL functions



CREATE FUNCTION dbo.SpookyHash
(
@Input varbinary(8000)
)
RETURNS binary(16)
WITH
RETURNS NULL ON NULL INPUT,
EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
GO
CREATE FUNCTION dbo.SpookyHashLOB
(
@Input varbinary(max)
)
RETURNS binary(16)
WITH
RETURNS NULL ON NULL INPUT,
EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
GO


Usage



An example use given the sample data in the question:



SELECT
HT1.ID
FROM dbo.HB_TBL AS HT1
JOIN dbo.HB_TBL_2 AS HT2
ON HT2.ID = HT1.ID
AND dbo.SpookyHash
(
CONVERT(binary(8), HT2.FK1) + 0x7C +
CONVERT(binary(8), HT2.FK2) + 0x7C +
CONVERT(binary(8), HT2.FK3) + 0x7C +
CONVERT(binary(8), HT2.FK4) + 0x7C +
CONVERT(binary(8), HT2.FK5) + 0x7C +
CONVERT(binary(8), HT2.FK6) + 0x7C +
CONVERT(binary(8), HT2.FK7) + 0x7C +
CONVERT(binary(8), HT2.FK8) + 0x7C +
CONVERT(binary(8), HT2.FK9) + 0x7C +
CONVERT(binary(8), HT2.FK10) + 0x7C +
CONVERT(binary(8), HT2.FK11) + 0x7C +
CONVERT(binary(8), HT2.FK12) + 0x7C +
CONVERT(binary(8), HT2.FK13) + 0x7C +
CONVERT(binary(8), HT2.FK14) + 0x7C +
CONVERT(binary(8), HT2.FK15) + 0x7C +
CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
CONVERT(binary(1), HT2.COMP1) + 0x7C +
CONVERT(binary(1), HT2.COMP2) + 0x7C +
CONVERT(binary(1), HT2.COMP3) + 0x7C +
CONVERT(binary(1), HT2.COMP4) + 0x7C +
CONVERT(binary(1), HT2.COMP5)
)
<> dbo.SpookyHash
(
CONVERT(binary(8), HT1.FK1) + 0x7C +
CONVERT(binary(8), HT1.FK2) + 0x7C +
CONVERT(binary(8), HT1.FK3) + 0x7C +
CONVERT(binary(8), HT1.FK4) + 0x7C +
CONVERT(binary(8), HT1.FK5) + 0x7C +
CONVERT(binary(8), HT1.FK6) + 0x7C +
CONVERT(binary(8), HT1.FK7) + 0x7C +
CONVERT(binary(8), HT1.FK8) + 0x7C +
CONVERT(binary(8), HT1.FK9) + 0x7C +
CONVERT(binary(8), HT1.FK10) + 0x7C +
CONVERT(binary(8), HT1.FK11) + 0x7C +
CONVERT(binary(8), HT1.FK12) + 0x7C +
CONVERT(binary(8), HT1.FK13) + 0x7C +
CONVERT(binary(8), HT1.FK14) + 0x7C +
CONVERT(binary(8), HT1.FK15) + 0x7C +
CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
CONVERT(binary(1), HT1.COMP1) + 0x7C +
CONVERT(binary(1), HT1.COMP2) + 0x7C +
CONVERT(binary(1), HT1.COMP3) + 0x7C +
CONVERT(binary(1), HT1.COMP4) + 0x7C +
CONVERT(binary(1), HT1.COMP5)
);


When using the LOB version, the first parameter should be cast or converted to varbinary(max).



Execution plan



plan






share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "182"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f228789%2fwhat-is-a-scalable-way-to-simulate-hashbytes-using-a-sql-clr-scalar-function%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    7














    I'm not sure if parallelism will be any / significantly better with SQLCLR. However, it is really easy to test since there is a hash function in the Free version of the SQL# SQLCLR library (which I wrote) called Util_HashBinary. Supported algorithms are: MD5, SHA1, SHA256, SHA384, and SHA512.



    It takes a VARBINARY(MAX) value as input, so you can either concatenate the string version of each field (as you are currently doing) and then convert to VARBINARY(MAX), or you can go directly to VARBINARY for each column and concatenate the converted values (this might be faster since you aren't dealing with strings or the extra conversion from string to VARBINARY). Below is an example showing both of these options. It also shows the HASHBYTES function so you can see that the values are the same between it and SQL#.Util_HashBinary.



    Please note that the hash results when concatenating the VARBINARY values won't match the hash results when concatenating the NVARCHAR values. This is because the binary form of the INT value "1" is 0x00000001, while the UTF-16LE (i.e. NVARCHAR) form of the INT value of "1" (in binary form since that is what a hashing function will operate on) is 0x3100.



    SELECT so.[object_id],
    SQL#.Util_HashBinary(N'SHA256',
    CONVERT(VARBINARY(MAX),
    CONCAT(so.[name], so.[schema_id], so.[create_date])
    )
    ) AS [SQLCLR-ConcatStrings],
    HASHBYTES(N'SHA2_256',
    CONVERT(VARBINARY(MAX),
    CONCAT(so.[name], so.[schema_id], so.[create_date])
    )
    ) AS [BuiltIn-ConcatStrings]
    FROM sys.objects so;


    SELECT so.[object_id],
    SQL#.Util_HashBinary(N'SHA256',
    CONVERT(VARBINARY(500), so.[name]) +
    CONVERT(VARBINARY(500), so.[schema_id]) +
    CONVERT(VARBINARY(500), so.[create_date])
    ) AS [SQLCLR-ConcatVarBinaries],
    HASHBYTES(N'SHA2_256',
    CONVERT(VARBINARY(500), so.[name]) +
    CONVERT(VARBINARY(500), so.[schema_id]) +
    CONVERT(VARBINARY(500), so.[create_date])
    ) AS [BuiltIn-ConcatVarBinaries]
    FROM sys.objects so;


    Beyond that aspect of the question, there are some additional thoughts that might help this process that are not related to SQLCLR. You mentioned a few things:






    1. we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.




      and:





    2. I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns




      and:




    3. the tables can be updated outside of the ETL process




    It sounds like the data in this reporting table is stable for a period of time, and is only modified by this ETL process.



    If nothing else modifies this table, then we really don't need a trigger or indexed view after all (I originally thought that you might).



    Since you can't modify the schema of the reporting table, would it at least be possible to create a related table to contain the pre-calculated hash (and UTC time of when it was calculated)? This would allow you to have a pre-calculated value to compare against next time, leaving only the incoming value that requires calculating the hash of. This would reduce the number of calls to either HASHBYTES or SQL#.Util_HashBinary by half. You would simply join to this table of hashes during the import process.



    You would also create a separate stored procedure that simply refreshes the hashes of this table. It just updates the hashes of any related row that has changed to be current, and updates the timestamp for those modified rows. This proc can/should be executed at the end of any other process that updates this table. It can also be scheduled to run 30 - 60 minutes prior to this ETL starting (depending on how long it takes to execute, and when any of these other processes might run). It can even be executed manually if you ever suspect there might be rows that are out of sync.



    ALSO: Paul White, in a comment on this answer, mentioned:




    One downside of replacing HASHBYTES with a CLR scalar function - it appears that CLR functions cannot use batch mode whereas HASHBYTES can. That might be important, performance-wise.




    So that is something to consider, and clearly requires testing.



    ALSO: regardless of SQLCLR vs built-in HASHBYTES, I would still recommend converting directly to VARBINARY as that should be faster. Concatenating strings is just not terribly efficient. And, that's in addition to converting non-string values into strings in the first place, which requires extra effort (I assume the amount of effort varies based on the base type: DATETIME requiring more than BIGINT), whereas converting to VARBINARY simply gives you the underlying value (in most cases).






    share|improve this answer


























    • Hi Solomon! Thanks for your answer. Assuming I can get everything installed I'll test this CLR on Monday. Do you mind sharing if your code calls into bcrypt at all? On the other point, due to how the application works I would need to use one of the following solutions: trigger, indexed view, or persisted computed column. If index maintenance could be done in a parallel zone that approach would be much more attractive to me. There are a lot of "local factors" at play which make storing the value difficult.

      – Joe Obbish
      yesterday











    • Hi @JoeObbish . re: using bcrypt, no not unless it is used behind the scenes by the managed .NET classes. Regarding the storage of pre-computed values: can you please clarify whether or not these reporting tables can be updated by anything outside of this ETL process? And if so, would it be a regularly scheduled thing (like another ETL process), or open-ended (more like an OLTP setup)? Also, might be worth testing a combination of both HASHBYTES and SQLCLR together to see if the thread limitation is specific to where the computation is actually happening.

      – Solomon Rutzky
      yesterday











    • Yes, the tables can be updated outside of the ETL process. Usually it would be something scheduled. A combination of HASHBYTES and CLR is an interesting idea, I'll include that as a benchmark. I don't have access to the test server right now so the earliest would be tomorrow.

      – Joe Obbish
      yesterday











    • @JoeObbish I'm in no rush ;-). Also, if the other scenarios that could update the table are controlled scenarios, then I believe my suggestion would still work. I had meant to mention that you can create a separate proc that simply refreshes the hashes of this table if you ever suspect they are out of sync. It just updates the hashes of anything that has changed to be current and updates the timestamp. You can have that proc executed at the end of any other process that updates this table, or can be schedule to run 30 - 60 minutes prior to this ETL starting, or manually, etc.

      – Solomon Rutzky
      yesterday











    • Also, I assume that there are at least 2 reporting tables you need to do this with, given that the schema noted in the question does not add up to 20,000 characters in a single row ;-)

      – Solomon Rutzky
      yesterday
















    7














    I'm not sure if parallelism will be any / significantly better with SQLCLR. However, it is really easy to test since there is a hash function in the Free version of the SQL# SQLCLR library (which I wrote) called Util_HashBinary. Supported algorithms are: MD5, SHA1, SHA256, SHA384, and SHA512.



    It takes a VARBINARY(MAX) value as input, so you can either concatenate the string version of each field (as you are currently doing) and then convert to VARBINARY(MAX), or you can go directly to VARBINARY for each column and concatenate the converted values (this might be faster since you aren't dealing with strings or the extra conversion from string to VARBINARY). Below is an example showing both of these options. It also shows the HASHBYTES function so you can see that the values are the same between it and SQL#.Util_HashBinary.



    Please note that the hash results when concatenating the VARBINARY values won't match the hash results when concatenating the NVARCHAR values. This is because the binary form of the INT value "1" is 0x00000001, while the UTF-16LE (i.e. NVARCHAR) form of the INT value of "1" (in binary form since that is what a hashing function will operate on) is 0x3100.



    SELECT so.[object_id],
    SQL#.Util_HashBinary(N'SHA256',
    CONVERT(VARBINARY(MAX),
    CONCAT(so.[name], so.[schema_id], so.[create_date])
    )
    ) AS [SQLCLR-ConcatStrings],
    HASHBYTES(N'SHA2_256',
    CONVERT(VARBINARY(MAX),
    CONCAT(so.[name], so.[schema_id], so.[create_date])
    )
    ) AS [BuiltIn-ConcatStrings]
    FROM sys.objects so;


    SELECT so.[object_id],
    SQL#.Util_HashBinary(N'SHA256',
    CONVERT(VARBINARY(500), so.[name]) +
    CONVERT(VARBINARY(500), so.[schema_id]) +
    CONVERT(VARBINARY(500), so.[create_date])
    ) AS [SQLCLR-ConcatVarBinaries],
    HASHBYTES(N'SHA2_256',
    CONVERT(VARBINARY(500), so.[name]) +
    CONVERT(VARBINARY(500), so.[schema_id]) +
    CONVERT(VARBINARY(500), so.[create_date])
    ) AS [BuiltIn-ConcatVarBinaries]
    FROM sys.objects so;


    Beyond that aspect of the question, there are some additional thoughts that might help this process that are not related to SQLCLR. You mentioned a few things:






    1. we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.




      and:





    2. I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns




      and:




    3. the tables can be updated outside of the ETL process




    It sounds like the data in this reporting table is stable for a period of time, and is only modified by this ETL process.



    If nothing else modifies this table, then we really don't need a trigger or indexed view after all (I originally thought that you might).



    Since you can't modify the schema of the reporting table, would it at least be possible to create a related table to contain the pre-calculated hash (and UTC time of when it was calculated)? This would allow you to have a pre-calculated value to compare against next time, leaving only the incoming value that requires calculating the hash of. This would reduce the number of calls to either HASHBYTES or SQL#.Util_HashBinary by half. You would simply join to this table of hashes during the import process.



    You would also create a separate stored procedure that simply refreshes the hashes of this table. It just updates the hashes of any related row that has changed to be current, and updates the timestamp for those modified rows. This proc can/should be executed at the end of any other process that updates this table. It can also be scheduled to run 30 - 60 minutes prior to this ETL starting (depending on how long it takes to execute, and when any of these other processes might run). It can even be executed manually if you ever suspect there might be rows that are out of sync.



    ALSO: Paul White, in a comment on this answer, mentioned:




    One downside of replacing HASHBYTES with a CLR scalar function - it appears that CLR functions cannot use batch mode whereas HASHBYTES can. That might be important, performance-wise.




    So that is something to consider, and clearly requires testing.



    ALSO: regardless of SQLCLR vs built-in HASHBYTES, I would still recommend converting directly to VARBINARY as that should be faster. Concatenating strings is just not terribly efficient. And, that's in addition to converting non-string values into strings in the first place, which requires extra effort (I assume the amount of effort varies based on the base type: DATETIME requiring more than BIGINT), whereas converting to VARBINARY simply gives you the underlying value (in most cases).






    share|improve this answer


























    • Hi Solomon! Thanks for your answer. Assuming I can get everything installed I'll test this CLR on Monday. Do you mind sharing if your code calls into bcrypt at all? On the other point, due to how the application works I would need to use one of the following solutions: trigger, indexed view, or persisted computed column. If index maintenance could be done in a parallel zone that approach would be much more attractive to me. There are a lot of "local factors" at play which make storing the value difficult.

      – Joe Obbish
      yesterday











    • Hi @JoeObbish . re: using bcrypt, no not unless it is used behind the scenes by the managed .NET classes. Regarding the storage of pre-computed values: can you please clarify whether or not these reporting tables can be updated by anything outside of this ETL process? And if so, would it be a regularly scheduled thing (like another ETL process), or open-ended (more like an OLTP setup)? Also, might be worth testing a combination of both HASHBYTES and SQLCLR together to see if the thread limitation is specific to where the computation is actually happening.

      – Solomon Rutzky
      yesterday











    • Yes, the tables can be updated outside of the ETL process. Usually it would be something scheduled. A combination of HASHBYTES and CLR is an interesting idea, I'll include that as a benchmark. I don't have access to the test server right now so the earliest would be tomorrow.

      – Joe Obbish
      yesterday











    • @JoeObbish I'm in no rush ;-). Also, if the other scenarios that could update the table are controlled scenarios, then I believe my suggestion would still work. I had meant to mention that you can create a separate proc that simply refreshes the hashes of this table if you ever suspect they are out of sync. It just updates the hashes of anything that has changed to be current and updates the timestamp. You can have that proc executed at the end of any other process that updates this table, or can be schedule to run 30 - 60 minutes prior to this ETL starting, or manually, etc.

      – Solomon Rutzky
      yesterday











    • Also, I assume that there are at least 2 reporting tables you need to do this with, given that the schema noted in the question does not add up to 20,000 characters in a single row ;-)

      – Solomon Rutzky
      yesterday














    7












    7








    7







    I'm not sure if parallelism will be any / significantly better with SQLCLR. However, it is really easy to test since there is a hash function in the Free version of the SQL# SQLCLR library (which I wrote) called Util_HashBinary. Supported algorithms are: MD5, SHA1, SHA256, SHA384, and SHA512.



    It takes a VARBINARY(MAX) value as input, so you can either concatenate the string version of each field (as you are currently doing) and then convert to VARBINARY(MAX), or you can go directly to VARBINARY for each column and concatenate the converted values (this might be faster since you aren't dealing with strings or the extra conversion from string to VARBINARY). Below is an example showing both of these options. It also shows the HASHBYTES function so you can see that the values are the same between it and SQL#.Util_HashBinary.



    Please note that the hash results when concatenating the VARBINARY values won't match the hash results when concatenating the NVARCHAR values. This is because the binary form of the INT value "1" is 0x00000001, while the UTF-16LE (i.e. NVARCHAR) form of the INT value of "1" (in binary form since that is what a hashing function will operate on) is 0x3100.



    SELECT so.[object_id],
    SQL#.Util_HashBinary(N'SHA256',
    CONVERT(VARBINARY(MAX),
    CONCAT(so.[name], so.[schema_id], so.[create_date])
    )
    ) AS [SQLCLR-ConcatStrings],
    HASHBYTES(N'SHA2_256',
    CONVERT(VARBINARY(MAX),
    CONCAT(so.[name], so.[schema_id], so.[create_date])
    )
    ) AS [BuiltIn-ConcatStrings]
    FROM sys.objects so;


    SELECT so.[object_id],
    SQL#.Util_HashBinary(N'SHA256',
    CONVERT(VARBINARY(500), so.[name]) +
    CONVERT(VARBINARY(500), so.[schema_id]) +
    CONVERT(VARBINARY(500), so.[create_date])
    ) AS [SQLCLR-ConcatVarBinaries],
    HASHBYTES(N'SHA2_256',
    CONVERT(VARBINARY(500), so.[name]) +
    CONVERT(VARBINARY(500), so.[schema_id]) +
    CONVERT(VARBINARY(500), so.[create_date])
    ) AS [BuiltIn-ConcatVarBinaries]
    FROM sys.objects so;


    Beyond that aspect of the question, there are some additional thoughts that might help this process that are not related to SQLCLR. You mentioned a few things:






    1. we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.




      and:





    2. I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns




      and:




    3. the tables can be updated outside of the ETL process




    It sounds like the data in this reporting table is stable for a period of time, and is only modified by this ETL process.



    If nothing else modifies this table, then we really don't need a trigger or indexed view after all (I originally thought that you might).



    Since you can't modify the schema of the reporting table, would it at least be possible to create a related table to contain the pre-calculated hash (and UTC time of when it was calculated)? This would allow you to have a pre-calculated value to compare against next time, leaving only the incoming value that requires calculating the hash of. This would reduce the number of calls to either HASHBYTES or SQL#.Util_HashBinary by half. You would simply join to this table of hashes during the import process.



    You would also create a separate stored procedure that simply refreshes the hashes of this table. It just updates the hashes of any related row that has changed to be current, and updates the timestamp for those modified rows. This proc can/should be executed at the end of any other process that updates this table. It can also be scheduled to run 30 - 60 minutes prior to this ETL starting (depending on how long it takes to execute, and when any of these other processes might run). It can even be executed manually if you ever suspect there might be rows that are out of sync.



    ALSO: Paul White, in a comment on this answer, mentioned:




    One downside of replacing HASHBYTES with a CLR scalar function - it appears that CLR functions cannot use batch mode whereas HASHBYTES can. That might be important, performance-wise.




    So that is something to consider, and clearly requires testing.



    ALSO: regardless of SQLCLR vs built-in HASHBYTES, I would still recommend converting directly to VARBINARY as that should be faster. Concatenating strings is just not terribly efficient. And, that's in addition to converting non-string values into strings in the first place, which requires extra effort (I assume the amount of effort varies based on the base type: DATETIME requiring more than BIGINT), whereas converting to VARBINARY simply gives you the underlying value (in most cases).






    share|improve this answer















    I'm not sure if parallelism will be any / significantly better with SQLCLR. However, it is really easy to test since there is a hash function in the Free version of the SQL# SQLCLR library (which I wrote) called Util_HashBinary. Supported algorithms are: MD5, SHA1, SHA256, SHA384, and SHA512.



    It takes a VARBINARY(MAX) value as input, so you can either concatenate the string version of each field (as you are currently doing) and then convert to VARBINARY(MAX), or you can go directly to VARBINARY for each column and concatenate the converted values (this might be faster since you aren't dealing with strings or the extra conversion from string to VARBINARY). Below is an example showing both of these options. It also shows the HASHBYTES function so you can see that the values are the same between it and SQL#.Util_HashBinary.



    Please note that the hash results when concatenating the VARBINARY values won't match the hash results when concatenating the NVARCHAR values. This is because the binary form of the INT value "1" is 0x00000001, while the UTF-16LE (i.e. NVARCHAR) form of the INT value of "1" (in binary form since that is what a hashing function will operate on) is 0x3100.



    SELECT so.[object_id],
    SQL#.Util_HashBinary(N'SHA256',
    CONVERT(VARBINARY(MAX),
    CONCAT(so.[name], so.[schema_id], so.[create_date])
    )
    ) AS [SQLCLR-ConcatStrings],
    HASHBYTES(N'SHA2_256',
    CONVERT(VARBINARY(MAX),
    CONCAT(so.[name], so.[schema_id], so.[create_date])
    )
    ) AS [BuiltIn-ConcatStrings]
    FROM sys.objects so;


    SELECT so.[object_id],
    SQL#.Util_HashBinary(N'SHA256',
    CONVERT(VARBINARY(500), so.[name]) +
    CONVERT(VARBINARY(500), so.[schema_id]) +
    CONVERT(VARBINARY(500), so.[create_date])
    ) AS [SQLCLR-ConcatVarBinaries],
    HASHBYTES(N'SHA2_256',
    CONVERT(VARBINARY(500), so.[name]) +
    CONVERT(VARBINARY(500), so.[schema_id]) +
    CONVERT(VARBINARY(500), so.[create_date])
    ) AS [BuiltIn-ConcatVarBinaries]
    FROM sys.objects so;


    Beyond that aspect of the question, there are some additional thoughts that might help this process that are not related to SQLCLR. You mentioned a few things:






    1. we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.




      and:





    2. I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns




      and:




    3. the tables can be updated outside of the ETL process




    It sounds like the data in this reporting table is stable for a period of time, and is only modified by this ETL process.



    If nothing else modifies this table, then we really don't need a trigger or indexed view after all (I originally thought that you might).



    Since you can't modify the schema of the reporting table, would it at least be possible to create a related table to contain the pre-calculated hash (and UTC time of when it was calculated)? This would allow you to have a pre-calculated value to compare against next time, leaving only the incoming value that requires calculating the hash of. This would reduce the number of calls to either HASHBYTES or SQL#.Util_HashBinary by half. You would simply join to this table of hashes during the import process.



    You would also create a separate stored procedure that simply refreshes the hashes of this table. It just updates the hashes of any related row that has changed to be current, and updates the timestamp for those modified rows. This proc can/should be executed at the end of any other process that updates this table. It can also be scheduled to run 30 - 60 minutes prior to this ETL starting (depending on how long it takes to execute, and when any of these other processes might run). It can even be executed manually if you ever suspect there might be rows that are out of sync.



    ALSO: Paul White, in a comment on this answer, mentioned:




    One downside of replacing HASHBYTES with a CLR scalar function - it appears that CLR functions cannot use batch mode whereas HASHBYTES can. That might be important, performance-wise.




    So that is something to consider, and clearly requires testing.



    ALSO: regardless of SQLCLR vs built-in HASHBYTES, I would still recommend converting directly to VARBINARY as that should be faster. Concatenating strings is just not terribly efficient. And, that's in addition to converting non-string values into strings in the first place, which requires extra effort (I assume the amount of effort varies based on the base type: DATETIME requiring more than BIGINT), whereas converting to VARBINARY simply gives you the underlying value (in most cases).







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited yesterday

























    answered yesterday









    Solomon RutzkySolomon Rutzky

    48.1k579174




    48.1k579174













    • Hi Solomon! Thanks for your answer. Assuming I can get everything installed I'll test this CLR on Monday. Do you mind sharing if your code calls into bcrypt at all? On the other point, due to how the application works I would need to use one of the following solutions: trigger, indexed view, or persisted computed column. If index maintenance could be done in a parallel zone that approach would be much more attractive to me. There are a lot of "local factors" at play which make storing the value difficult.

      – Joe Obbish
      yesterday











    • Hi @JoeObbish . re: using bcrypt, no not unless it is used behind the scenes by the managed .NET classes. Regarding the storage of pre-computed values: can you please clarify whether or not these reporting tables can be updated by anything outside of this ETL process? And if so, would it be a regularly scheduled thing (like another ETL process), or open-ended (more like an OLTP setup)? Also, might be worth testing a combination of both HASHBYTES and SQLCLR together to see if the thread limitation is specific to where the computation is actually happening.

      – Solomon Rutzky
      yesterday











    • Yes, the tables can be updated outside of the ETL process. Usually it would be something scheduled. A combination of HASHBYTES and CLR is an interesting idea, I'll include that as a benchmark. I don't have access to the test server right now so the earliest would be tomorrow.

      – Joe Obbish
      yesterday











    • @JoeObbish I'm in no rush ;-). Also, if the other scenarios that could update the table are controlled scenarios, then I believe my suggestion would still work. I had meant to mention that you can create a separate proc that simply refreshes the hashes of this table if you ever suspect they are out of sync. It just updates the hashes of anything that has changed to be current and updates the timestamp. You can have that proc executed at the end of any other process that updates this table, or can be schedule to run 30 - 60 minutes prior to this ETL starting, or manually, etc.

      – Solomon Rutzky
      yesterday











    • Also, I assume that there are at least 2 reporting tables you need to do this with, given that the schema noted in the question does not add up to 20,000 characters in a single row ;-)

      – Solomon Rutzky
      yesterday



















    • Hi Solomon! Thanks for your answer. Assuming I can get everything installed I'll test this CLR on Monday. Do you mind sharing if your code calls into bcrypt at all? On the other point, due to how the application works I would need to use one of the following solutions: trigger, indexed view, or persisted computed column. If index maintenance could be done in a parallel zone that approach would be much more attractive to me. There are a lot of "local factors" at play which make storing the value difficult.

      – Joe Obbish
      yesterday











    • Hi @JoeObbish . re: using bcrypt, no not unless it is used behind the scenes by the managed .NET classes. Regarding the storage of pre-computed values: can you please clarify whether or not these reporting tables can be updated by anything outside of this ETL process? And if so, would it be a regularly scheduled thing (like another ETL process), or open-ended (more like an OLTP setup)? Also, might be worth testing a combination of both HASHBYTES and SQLCLR together to see if the thread limitation is specific to where the computation is actually happening.

      – Solomon Rutzky
      yesterday











    • Yes, the tables can be updated outside of the ETL process. Usually it would be something scheduled. A combination of HASHBYTES and CLR is an interesting idea, I'll include that as a benchmark. I don't have access to the test server right now so the earliest would be tomorrow.

      – Joe Obbish
      yesterday











    • @JoeObbish I'm in no rush ;-). Also, if the other scenarios that could update the table are controlled scenarios, then I believe my suggestion would still work. I had meant to mention that you can create a separate proc that simply refreshes the hashes of this table if you ever suspect they are out of sync. It just updates the hashes of anything that has changed to be current and updates the timestamp. You can have that proc executed at the end of any other process that updates this table, or can be schedule to run 30 - 60 minutes prior to this ETL starting, or manually, etc.

      – Solomon Rutzky
      yesterday











    • Also, I assume that there are at least 2 reporting tables you need to do this with, given that the schema noted in the question does not add up to 20,000 characters in a single row ;-)

      – Solomon Rutzky
      yesterday

















    Hi Solomon! Thanks for your answer. Assuming I can get everything installed I'll test this CLR on Monday. Do you mind sharing if your code calls into bcrypt at all? On the other point, due to how the application works I would need to use one of the following solutions: trigger, indexed view, or persisted computed column. If index maintenance could be done in a parallel zone that approach would be much more attractive to me. There are a lot of "local factors" at play which make storing the value difficult.

    – Joe Obbish
    yesterday





    Hi Solomon! Thanks for your answer. Assuming I can get everything installed I'll test this CLR on Monday. Do you mind sharing if your code calls into bcrypt at all? On the other point, due to how the application works I would need to use one of the following solutions: trigger, indexed view, or persisted computed column. If index maintenance could be done in a parallel zone that approach would be much more attractive to me. There are a lot of "local factors" at play which make storing the value difficult.

    – Joe Obbish
    yesterday













    Hi @JoeObbish . re: using bcrypt, no not unless it is used behind the scenes by the managed .NET classes. Regarding the storage of pre-computed values: can you please clarify whether or not these reporting tables can be updated by anything outside of this ETL process? And if so, would it be a regularly scheduled thing (like another ETL process), or open-ended (more like an OLTP setup)? Also, might be worth testing a combination of both HASHBYTES and SQLCLR together to see if the thread limitation is specific to where the computation is actually happening.

    – Solomon Rutzky
    yesterday





    Hi @JoeObbish . re: using bcrypt, no not unless it is used behind the scenes by the managed .NET classes. Regarding the storage of pre-computed values: can you please clarify whether or not these reporting tables can be updated by anything outside of this ETL process? And if so, would it be a regularly scheduled thing (like another ETL process), or open-ended (more like an OLTP setup)? Also, might be worth testing a combination of both HASHBYTES and SQLCLR together to see if the thread limitation is specific to where the computation is actually happening.

    – Solomon Rutzky
    yesterday













    Yes, the tables can be updated outside of the ETL process. Usually it would be something scheduled. A combination of HASHBYTES and CLR is an interesting idea, I'll include that as a benchmark. I don't have access to the test server right now so the earliest would be tomorrow.

    – Joe Obbish
    yesterday





    Yes, the tables can be updated outside of the ETL process. Usually it would be something scheduled. A combination of HASHBYTES and CLR is an interesting idea, I'll include that as a benchmark. I don't have access to the test server right now so the earliest would be tomorrow.

    – Joe Obbish
    yesterday













    @JoeObbish I'm in no rush ;-). Also, if the other scenarios that could update the table are controlled scenarios, then I believe my suggestion would still work. I had meant to mention that you can create a separate proc that simply refreshes the hashes of this table if you ever suspect they are out of sync. It just updates the hashes of anything that has changed to be current and updates the timestamp. You can have that proc executed at the end of any other process that updates this table, or can be schedule to run 30 - 60 minutes prior to this ETL starting, or manually, etc.

    – Solomon Rutzky
    yesterday





    @JoeObbish I'm in no rush ;-). Also, if the other scenarios that could update the table are controlled scenarios, then I believe my suggestion would still work. I had meant to mention that you can create a separate proc that simply refreshes the hashes of this table if you ever suspect they are out of sync. It just updates the hashes of anything that has changed to be current and updates the timestamp. You can have that proc executed at the end of any other process that updates this table, or can be schedule to run 30 - 60 minutes prior to this ETL starting, or manually, etc.

    – Solomon Rutzky
    yesterday













    Also, I assume that there are at least 2 reporting tables you need to do this with, given that the schema noted in the question does not add up to 20,000 characters in a single row ;-)

    – Solomon Rutzky
    yesterday





    Also, I assume that there are at least 2 reporting tables you need to do this with, given that the schema noted in the question does not add up to 20,000 characters in a single row ;-)

    – Solomon Rutzky
    yesterday













    3














    Since you're just looking for changes, you don't need a cryptographic hash function.



    You could choose from one of the faster non-cryptographic hashes in the Data.HashFunction library by Brandon Dahler. SpookyHash is a popular choice.



    Example implementation



    Source Code



    using Microsoft.SqlServer.Server;
    using System.Data.HashFunction.SpookyHash;
    using System.Data.SqlTypes;

    public partial class UserDefinedFunctions
    {
    [SqlFunction
    (
    DataAccess = DataAccessKind.None,
    SystemDataAccess = SystemDataAccessKind.None,
    IsDeterministic = true,
    IsPrecise = true
    )
    ]
    public static byte SpookyHash
    (
    [SqlFacet (MaxSize = 8000)]
    SqlBinary Input
    )
    {
    ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
    return sh.ComputeHash(Input.Value).Hash;
    }

    [SqlFunction
    (
    DataAccess = DataAccessKind.None,
    IsDeterministic = true,
    IsPrecise = true,
    SystemDataAccess = SystemDataAccessKind.None
    )
    ]
    public static byte SpookyHashLOB
    (
    [SqlFacet (MaxSize = -1)]
    SqlBinary Input
    )
    {
    ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
    return sh.ComputeHash(Input.Value).Hash;
    }
    }


    The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version.



    Pre-built code



    You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:



    https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300



    T-SQL functions



    CREATE FUNCTION dbo.SpookyHash
    (
    @Input varbinary(8000)
    )
    RETURNS binary(16)
    WITH
    RETURNS NULL ON NULL INPUT,
    EXECUTE AS OWNER
    AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
    GO
    CREATE FUNCTION dbo.SpookyHashLOB
    (
    @Input varbinary(max)
    )
    RETURNS binary(16)
    WITH
    RETURNS NULL ON NULL INPUT,
    EXECUTE AS OWNER
    AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
    GO


    Usage



    An example use given the sample data in the question:



    SELECT
    HT1.ID
    FROM dbo.HB_TBL AS HT1
    JOIN dbo.HB_TBL_2 AS HT2
    ON HT2.ID = HT1.ID
    AND dbo.SpookyHash
    (
    CONVERT(binary(8), HT2.FK1) + 0x7C +
    CONVERT(binary(8), HT2.FK2) + 0x7C +
    CONVERT(binary(8), HT2.FK3) + 0x7C +
    CONVERT(binary(8), HT2.FK4) + 0x7C +
    CONVERT(binary(8), HT2.FK5) + 0x7C +
    CONVERT(binary(8), HT2.FK6) + 0x7C +
    CONVERT(binary(8), HT2.FK7) + 0x7C +
    CONVERT(binary(8), HT2.FK8) + 0x7C +
    CONVERT(binary(8), HT2.FK9) + 0x7C +
    CONVERT(binary(8), HT2.FK10) + 0x7C +
    CONVERT(binary(8), HT2.FK11) + 0x7C +
    CONVERT(binary(8), HT2.FK12) + 0x7C +
    CONVERT(binary(8), HT2.FK13) + 0x7C +
    CONVERT(binary(8), HT2.FK14) + 0x7C +
    CONVERT(binary(8), HT2.FK15) + 0x7C +
    CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
    CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
    CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
    CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
    CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
    CONVERT(binary(1), HT2.COMP1) + 0x7C +
    CONVERT(binary(1), HT2.COMP2) + 0x7C +
    CONVERT(binary(1), HT2.COMP3) + 0x7C +
    CONVERT(binary(1), HT2.COMP4) + 0x7C +
    CONVERT(binary(1), HT2.COMP5)
    )
    <> dbo.SpookyHash
    (
    CONVERT(binary(8), HT1.FK1) + 0x7C +
    CONVERT(binary(8), HT1.FK2) + 0x7C +
    CONVERT(binary(8), HT1.FK3) + 0x7C +
    CONVERT(binary(8), HT1.FK4) + 0x7C +
    CONVERT(binary(8), HT1.FK5) + 0x7C +
    CONVERT(binary(8), HT1.FK6) + 0x7C +
    CONVERT(binary(8), HT1.FK7) + 0x7C +
    CONVERT(binary(8), HT1.FK8) + 0x7C +
    CONVERT(binary(8), HT1.FK9) + 0x7C +
    CONVERT(binary(8), HT1.FK10) + 0x7C +
    CONVERT(binary(8), HT1.FK11) + 0x7C +
    CONVERT(binary(8), HT1.FK12) + 0x7C +
    CONVERT(binary(8), HT1.FK13) + 0x7C +
    CONVERT(binary(8), HT1.FK14) + 0x7C +
    CONVERT(binary(8), HT1.FK15) + 0x7C +
    CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
    CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
    CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
    CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
    CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
    CONVERT(binary(1), HT1.COMP1) + 0x7C +
    CONVERT(binary(1), HT1.COMP2) + 0x7C +
    CONVERT(binary(1), HT1.COMP3) + 0x7C +
    CONVERT(binary(1), HT1.COMP4) + 0x7C +
    CONVERT(binary(1), HT1.COMP5)
    );


    When using the LOB version, the first parameter should be cast or converted to varbinary(max).



    Execution plan



    plan






    share|improve this answer




























      3














      Since you're just looking for changes, you don't need a cryptographic hash function.



      You could choose from one of the faster non-cryptographic hashes in the Data.HashFunction library by Brandon Dahler. SpookyHash is a popular choice.



      Example implementation



      Source Code



      using Microsoft.SqlServer.Server;
      using System.Data.HashFunction.SpookyHash;
      using System.Data.SqlTypes;

      public partial class UserDefinedFunctions
      {
      [SqlFunction
      (
      DataAccess = DataAccessKind.None,
      SystemDataAccess = SystemDataAccessKind.None,
      IsDeterministic = true,
      IsPrecise = true
      )
      ]
      public static byte SpookyHash
      (
      [SqlFacet (MaxSize = 8000)]
      SqlBinary Input
      )
      {
      ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
      return sh.ComputeHash(Input.Value).Hash;
      }

      [SqlFunction
      (
      DataAccess = DataAccessKind.None,
      IsDeterministic = true,
      IsPrecise = true,
      SystemDataAccess = SystemDataAccessKind.None
      )
      ]
      public static byte SpookyHashLOB
      (
      [SqlFacet (MaxSize = -1)]
      SqlBinary Input
      )
      {
      ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
      return sh.ComputeHash(Input.Value).Hash;
      }
      }


      The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version.



      Pre-built code



      You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:



      https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300



      T-SQL functions



      CREATE FUNCTION dbo.SpookyHash
      (
      @Input varbinary(8000)
      )
      RETURNS binary(16)
      WITH
      RETURNS NULL ON NULL INPUT,
      EXECUTE AS OWNER
      AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
      GO
      CREATE FUNCTION dbo.SpookyHashLOB
      (
      @Input varbinary(max)
      )
      RETURNS binary(16)
      WITH
      RETURNS NULL ON NULL INPUT,
      EXECUTE AS OWNER
      AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
      GO


      Usage



      An example use given the sample data in the question:



      SELECT
      HT1.ID
      FROM dbo.HB_TBL AS HT1
      JOIN dbo.HB_TBL_2 AS HT2
      ON HT2.ID = HT1.ID
      AND dbo.SpookyHash
      (
      CONVERT(binary(8), HT2.FK1) + 0x7C +
      CONVERT(binary(8), HT2.FK2) + 0x7C +
      CONVERT(binary(8), HT2.FK3) + 0x7C +
      CONVERT(binary(8), HT2.FK4) + 0x7C +
      CONVERT(binary(8), HT2.FK5) + 0x7C +
      CONVERT(binary(8), HT2.FK6) + 0x7C +
      CONVERT(binary(8), HT2.FK7) + 0x7C +
      CONVERT(binary(8), HT2.FK8) + 0x7C +
      CONVERT(binary(8), HT2.FK9) + 0x7C +
      CONVERT(binary(8), HT2.FK10) + 0x7C +
      CONVERT(binary(8), HT2.FK11) + 0x7C +
      CONVERT(binary(8), HT2.FK12) + 0x7C +
      CONVERT(binary(8), HT2.FK13) + 0x7C +
      CONVERT(binary(8), HT2.FK14) + 0x7C +
      CONVERT(binary(8), HT2.FK15) + 0x7C +
      CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
      CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
      CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
      CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
      CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
      CONVERT(binary(1), HT2.COMP1) + 0x7C +
      CONVERT(binary(1), HT2.COMP2) + 0x7C +
      CONVERT(binary(1), HT2.COMP3) + 0x7C +
      CONVERT(binary(1), HT2.COMP4) + 0x7C +
      CONVERT(binary(1), HT2.COMP5)
      )
      <> dbo.SpookyHash
      (
      CONVERT(binary(8), HT1.FK1) + 0x7C +
      CONVERT(binary(8), HT1.FK2) + 0x7C +
      CONVERT(binary(8), HT1.FK3) + 0x7C +
      CONVERT(binary(8), HT1.FK4) + 0x7C +
      CONVERT(binary(8), HT1.FK5) + 0x7C +
      CONVERT(binary(8), HT1.FK6) + 0x7C +
      CONVERT(binary(8), HT1.FK7) + 0x7C +
      CONVERT(binary(8), HT1.FK8) + 0x7C +
      CONVERT(binary(8), HT1.FK9) + 0x7C +
      CONVERT(binary(8), HT1.FK10) + 0x7C +
      CONVERT(binary(8), HT1.FK11) + 0x7C +
      CONVERT(binary(8), HT1.FK12) + 0x7C +
      CONVERT(binary(8), HT1.FK13) + 0x7C +
      CONVERT(binary(8), HT1.FK14) + 0x7C +
      CONVERT(binary(8), HT1.FK15) + 0x7C +
      CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
      CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
      CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
      CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
      CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
      CONVERT(binary(1), HT1.COMP1) + 0x7C +
      CONVERT(binary(1), HT1.COMP2) + 0x7C +
      CONVERT(binary(1), HT1.COMP3) + 0x7C +
      CONVERT(binary(1), HT1.COMP4) + 0x7C +
      CONVERT(binary(1), HT1.COMP5)
      );


      When using the LOB version, the first parameter should be cast or converted to varbinary(max).



      Execution plan



      plan






      share|improve this answer


























        3












        3








        3







        Since you're just looking for changes, you don't need a cryptographic hash function.



        You could choose from one of the faster non-cryptographic hashes in the Data.HashFunction library by Brandon Dahler. SpookyHash is a popular choice.



        Example implementation



        Source Code



        using Microsoft.SqlServer.Server;
        using System.Data.HashFunction.SpookyHash;
        using System.Data.SqlTypes;

        public partial class UserDefinedFunctions
        {
        [SqlFunction
        (
        DataAccess = DataAccessKind.None,
        SystemDataAccess = SystemDataAccessKind.None,
        IsDeterministic = true,
        IsPrecise = true
        )
        ]
        public static byte SpookyHash
        (
        [SqlFacet (MaxSize = 8000)]
        SqlBinary Input
        )
        {
        ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
        return sh.ComputeHash(Input.Value).Hash;
        }

        [SqlFunction
        (
        DataAccess = DataAccessKind.None,
        IsDeterministic = true,
        IsPrecise = true,
        SystemDataAccess = SystemDataAccessKind.None
        )
        ]
        public static byte SpookyHashLOB
        (
        [SqlFacet (MaxSize = -1)]
        SqlBinary Input
        )
        {
        ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
        return sh.ComputeHash(Input.Value).Hash;
        }
        }


        The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version.



        Pre-built code



        You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:



        https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300



        T-SQL functions



        CREATE FUNCTION dbo.SpookyHash
        (
        @Input varbinary(8000)
        )
        RETURNS binary(16)
        WITH
        RETURNS NULL ON NULL INPUT,
        EXECUTE AS OWNER
        AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
        GO
        CREATE FUNCTION dbo.SpookyHashLOB
        (
        @Input varbinary(max)
        )
        RETURNS binary(16)
        WITH
        RETURNS NULL ON NULL INPUT,
        EXECUTE AS OWNER
        AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
        GO


        Usage



        An example use given the sample data in the question:



        SELECT
        HT1.ID
        FROM dbo.HB_TBL AS HT1
        JOIN dbo.HB_TBL_2 AS HT2
        ON HT2.ID = HT1.ID
        AND dbo.SpookyHash
        (
        CONVERT(binary(8), HT2.FK1) + 0x7C +
        CONVERT(binary(8), HT2.FK2) + 0x7C +
        CONVERT(binary(8), HT2.FK3) + 0x7C +
        CONVERT(binary(8), HT2.FK4) + 0x7C +
        CONVERT(binary(8), HT2.FK5) + 0x7C +
        CONVERT(binary(8), HT2.FK6) + 0x7C +
        CONVERT(binary(8), HT2.FK7) + 0x7C +
        CONVERT(binary(8), HT2.FK8) + 0x7C +
        CONVERT(binary(8), HT2.FK9) + 0x7C +
        CONVERT(binary(8), HT2.FK10) + 0x7C +
        CONVERT(binary(8), HT2.FK11) + 0x7C +
        CONVERT(binary(8), HT2.FK12) + 0x7C +
        CONVERT(binary(8), HT2.FK13) + 0x7C +
        CONVERT(binary(8), HT2.FK14) + 0x7C +
        CONVERT(binary(8), HT2.FK15) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
        CONVERT(binary(1), HT2.COMP1) + 0x7C +
        CONVERT(binary(1), HT2.COMP2) + 0x7C +
        CONVERT(binary(1), HT2.COMP3) + 0x7C +
        CONVERT(binary(1), HT2.COMP4) + 0x7C +
        CONVERT(binary(1), HT2.COMP5)
        )
        <> dbo.SpookyHash
        (
        CONVERT(binary(8), HT1.FK1) + 0x7C +
        CONVERT(binary(8), HT1.FK2) + 0x7C +
        CONVERT(binary(8), HT1.FK3) + 0x7C +
        CONVERT(binary(8), HT1.FK4) + 0x7C +
        CONVERT(binary(8), HT1.FK5) + 0x7C +
        CONVERT(binary(8), HT1.FK6) + 0x7C +
        CONVERT(binary(8), HT1.FK7) + 0x7C +
        CONVERT(binary(8), HT1.FK8) + 0x7C +
        CONVERT(binary(8), HT1.FK9) + 0x7C +
        CONVERT(binary(8), HT1.FK10) + 0x7C +
        CONVERT(binary(8), HT1.FK11) + 0x7C +
        CONVERT(binary(8), HT1.FK12) + 0x7C +
        CONVERT(binary(8), HT1.FK13) + 0x7C +
        CONVERT(binary(8), HT1.FK14) + 0x7C +
        CONVERT(binary(8), HT1.FK15) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
        CONVERT(binary(1), HT1.COMP1) + 0x7C +
        CONVERT(binary(1), HT1.COMP2) + 0x7C +
        CONVERT(binary(1), HT1.COMP3) + 0x7C +
        CONVERT(binary(1), HT1.COMP4) + 0x7C +
        CONVERT(binary(1), HT1.COMP5)
        );


        When using the LOB version, the first parameter should be cast or converted to varbinary(max).



        Execution plan



        plan






        share|improve this answer













        Since you're just looking for changes, you don't need a cryptographic hash function.



        You could choose from one of the faster non-cryptographic hashes in the Data.HashFunction library by Brandon Dahler. SpookyHash is a popular choice.



        Example implementation



        Source Code



        using Microsoft.SqlServer.Server;
        using System.Data.HashFunction.SpookyHash;
        using System.Data.SqlTypes;

        public partial class UserDefinedFunctions
        {
        [SqlFunction
        (
        DataAccess = DataAccessKind.None,
        SystemDataAccess = SystemDataAccessKind.None,
        IsDeterministic = true,
        IsPrecise = true
        )
        ]
        public static byte SpookyHash
        (
        [SqlFacet (MaxSize = 8000)]
        SqlBinary Input
        )
        {
        ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
        return sh.ComputeHash(Input.Value).Hash;
        }

        [SqlFunction
        (
        DataAccess = DataAccessKind.None,
        IsDeterministic = true,
        IsPrecise = true,
        SystemDataAccess = SystemDataAccessKind.None
        )
        ]
        public static byte SpookyHashLOB
        (
        [SqlFacet (MaxSize = -1)]
        SqlBinary Input
        )
        {
        ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
        return sh.ComputeHash(Input.Value).Hash;
        }
        }


        The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version.



        Pre-built code



        You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:



        https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300



        T-SQL functions



        CREATE FUNCTION dbo.SpookyHash
        (
        @Input varbinary(8000)
        )
        RETURNS binary(16)
        WITH
        RETURNS NULL ON NULL INPUT,
        EXECUTE AS OWNER
        AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
        GO
        CREATE FUNCTION dbo.SpookyHashLOB
        (
        @Input varbinary(max)
        )
        RETURNS binary(16)
        WITH
        RETURNS NULL ON NULL INPUT,
        EXECUTE AS OWNER
        AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
        GO


        Usage



        An example use given the sample data in the question:



        SELECT
        HT1.ID
        FROM dbo.HB_TBL AS HT1
        JOIN dbo.HB_TBL_2 AS HT2
        ON HT2.ID = HT1.ID
        AND dbo.SpookyHash
        (
        CONVERT(binary(8), HT2.FK1) + 0x7C +
        CONVERT(binary(8), HT2.FK2) + 0x7C +
        CONVERT(binary(8), HT2.FK3) + 0x7C +
        CONVERT(binary(8), HT2.FK4) + 0x7C +
        CONVERT(binary(8), HT2.FK5) + 0x7C +
        CONVERT(binary(8), HT2.FK6) + 0x7C +
        CONVERT(binary(8), HT2.FK7) + 0x7C +
        CONVERT(binary(8), HT2.FK8) + 0x7C +
        CONVERT(binary(8), HT2.FK9) + 0x7C +
        CONVERT(binary(8), HT2.FK10) + 0x7C +
        CONVERT(binary(8), HT2.FK11) + 0x7C +
        CONVERT(binary(8), HT2.FK12) + 0x7C +
        CONVERT(binary(8), HT2.FK13) + 0x7C +
        CONVERT(binary(8), HT2.FK14) + 0x7C +
        CONVERT(binary(8), HT2.FK15) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
        CONVERT(binary(1), HT2.COMP1) + 0x7C +
        CONVERT(binary(1), HT2.COMP2) + 0x7C +
        CONVERT(binary(1), HT2.COMP3) + 0x7C +
        CONVERT(binary(1), HT2.COMP4) + 0x7C +
        CONVERT(binary(1), HT2.COMP5)
        )
        <> dbo.SpookyHash
        (
        CONVERT(binary(8), HT1.FK1) + 0x7C +
        CONVERT(binary(8), HT1.FK2) + 0x7C +
        CONVERT(binary(8), HT1.FK3) + 0x7C +
        CONVERT(binary(8), HT1.FK4) + 0x7C +
        CONVERT(binary(8), HT1.FK5) + 0x7C +
        CONVERT(binary(8), HT1.FK6) + 0x7C +
        CONVERT(binary(8), HT1.FK7) + 0x7C +
        CONVERT(binary(8), HT1.FK8) + 0x7C +
        CONVERT(binary(8), HT1.FK9) + 0x7C +
        CONVERT(binary(8), HT1.FK10) + 0x7C +
        CONVERT(binary(8), HT1.FK11) + 0x7C +
        CONVERT(binary(8), HT1.FK12) + 0x7C +
        CONVERT(binary(8), HT1.FK13) + 0x7C +
        CONVERT(binary(8), HT1.FK14) + 0x7C +
        CONVERT(binary(8), HT1.FK15) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
        CONVERT(binary(1), HT1.COMP1) + 0x7C +
        CONVERT(binary(1), HT1.COMP2) + 0x7C +
        CONVERT(binary(1), HT1.COMP3) + 0x7C +
        CONVERT(binary(1), HT1.COMP4) + 0x7C +
        CONVERT(binary(1), HT1.COMP5)
        );


        When using the LOB version, the first parameter should be cast or converted to varbinary(max).



        Execution plan



        plan







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 10 hours ago









        Paul WhitePaul White

        50.8k14277447




        50.8k14277447






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Database Administrators Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f228789%2fwhat-is-a-scalable-way-to-simulate-hashbytes-using-a-sql-clr-scalar-function%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown