T-SQL (SS2K8) :: Compare Two Varchar For Similarities?
Oct 28, 2014
I am asked to compare the address fields (three columns of nvarchar(100) ) of a customer database (around 10,000 records) and find any duplicates. If it is a character by character match, I could have just GROUPed and get the result.
But, I am expected to produce a list with similar addresses which the guys who entered may have use slightly different spelling or more or less characters, or a "." here and there.
I am trying to write a function to compare the characters between 2 strings and eliminate the similarities to be able to return at the end the number of differences between them.
Having in mind i need the bigger number of differences to be returned also if a character is repeated in one of the 2 words it will be eliminated once because it exist only one time in other string.
I will give an example below to be more clear
--Start declare @string1 as varchar(50)='imos' declare @string2 as varchar(50)='nasos'; WITH n (n) AS ( SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)
[Code] ....
The differences in first string from second one are 2 (i,m) while the differences in second string from first one are 3(nas). So the function should return 3 in previous example.
create table #myfirst (id int, city varchar(20)) insert into #myfirst values (500,'Newyork') insert into #myfirst values (100,'Ediosn') insert into #myfirst values (200,'Atlanta') insert into #myfirst values (300,'Greenwoods') insert into #myfirst values (400,'Hitchcok') insert into #myfirst values (700,'Walmart') insert into #myfirst values (800,'Madida')
-- My Second Data
create table #mySecond (id int, city varchar(20),Sector varchar(2)) insert into #mySecond values (1500,'Newyork','MK') insert into #mySecond values (5500,'Ediosn','HH') insert into #mySecond values (5060,'The Atlanta','JK') insert into #mySecond values (7500,'The Greenwoods','DF') insert into #mySecond values (9500,'Metro','KK') insert into #mySecond values (3300,'Kilapr','MK') insert into #mySecond values (9500,'Metro','NH')
--Third Second Data
create table #myThird (id int, city varchar(20),Sector varchar(2)) insert into #myThird values (33,'Walmart','PP') insert into #myThird values (20,'Ediosn','DD') select f.*,s.Sector from #myfirst f join #mySecond s on f.city = s.city /* idcitySector 500NewyorkMK 100EdiosnHH */
i have doubt on two things
1) How Can i compare the City names, by eliminating 'The ' at the beginning (if there is any in second tale city) between first and second
2) after comparing first and second if there is no match found in second them want to compare with third table values for those not found
--i tried below to solve first doubt, it is working but want to know any other wasys to do it
select f.*,s.Sector from #myfirst f join #mySecond s on replace (f.city, 'THE ','')= replace (s.city, 'THE ','')
--Expected results wull be
create table #ExpectResults (id int, city varchar(20),Sector varchar(2)) insert into #ExpectResults values (200,'Atlanta','JK') insert into #ExpectResults values (100,'Ediosn','HH') insert into #ExpectResults values (300,'Greenwoods','DF') insert into #ExpectResults values (500,'Newyork','MK') insert into #ExpectResults values (700, 'Walmart','PP') insert into #ExpectResults values (800, 'Madidar','')
I have a preexisting database that has dates in the format mm/dd/yyyy but it is set up as varcar. How can I apply numeric operators to it to find how old someone is from todays date? I then need to take that value and populate another column on the same row.
RID, RType, GID 001, m, g01 002, m, g01 002, m, g02 002, m, g03 003, m, g01 003, m, g03 a, T, g01 a, T, g02 a, T, g03 b, T, g02 b, T, g03 b, T, g04
4. Group
GID g01 g02 g03 g04
I'd like to find the record in table #1 "Matter" which has exact record of "GID" in table #3 "Security Assignment" compare with table #2 "Category"
In this case, it is record of "002" bacause "002" in table#1 "Matter" and the record "a" in table #2 "category" both has exact GID records(g01, g02, g03) in table #3, "Security Assignment"
How can I create qury to find all the possible record in the table #2?
We have a table setup to track changes that are made to another table, for auditing purposes. How do we compare the most recent record in the change table with the previous record in the change table? Particularly, we have a column named DUE_DATE in the change table and want to identify when the most recent change has a different DUE_DATE than the previous change made.
1) TableA : Which contains 5 columns(Column1,..........Column5) 2)TableB : Which contains 10 columns(Column1,..........Column10)
TableB contains millions of data.Now I want select all 5 columns from tableA but combination of Column1,Column2,Column3 if present in tableB, then i want exclude that records.I am doing as below:
select * from TableA a join TableB b a.column1!=b.column1 and a.column2!=b.column2 and a.column3!=b.column3 )
But query is taking almost 5 minutes. Is there is another approach?
Now i want to compare Result and dislay prevterm where student fail:
Now my output would be as: Now I want to compare latest term i.e. Term5 with prev Terms and if found Mismatch in result then i want to display as below:
I am fairly new to SQL and writing queries so bear with my faults. I am learning on the job, which is good and bad. Below is a query that I have written to obtain some information. The problem arises when we have a patient who goes from Patient Type '1' to Patient Type '2'. This needs to be considered a singular visit and the only way I can think that this may work is if: for any specific medical record a dsch_ts is equal to the Admit TS on the next row.
How to complete something like this and my google searches have been fruitless. I attached a spreadsheet with an example of what I am getting.
SELECT DISTINCT TPM300_PAT_VISIT.med_rec_no, TSM040_PERSON_HDR.lst_nm AS 'Last Name', TSM040_PERSON_HDR.fst_nm AS 'First Name',
I want to compare ONLY 1 Column values from 2 tables having more than 4.9 million records. There is a difference of 4000 rows between the 2 tables.
SELECT ID From TABLE1 where ID not in (SELECT DISTINCT ID From TABLE2)
My above query took nearly 4.5 hours to run and I had to cancel it. Is there a better way to write the query . I just want to compare the ID - column values which are missing in TABLE2
I have two tables I am trying to compare as I have created a new procedure to replace an old one and want to check if the new procedure produces similar results.
The problem is that when I run my compare I get false matches. Example:
CREATE TABLE #ABC (Acct VARCHAR(10), Que INT); INSERT INTO #ABC VALUES ('2310947',110), ('2310947',245);
[Code] ....
Which gives me two records when I really do not want any as the tables are identical.
--drop table #temp create table #temp (id int, idvalue int) insert into #temp(id,idvalue) select 1095,75
[code]...
I need to take the id value from maximum's id, and compare the rest id value from the table. i need to check the diffrence , if diffrence is more than 18, then i need to raise the flag as failure otherwise the whole test is success. i need to take 63 and compare rest 69,65,61,75.check the diffrence less than 18 or not.
I want to display records from @table1 only when combination of col2,col3 and col4 are present in @table2.In Below case I want output as: below two records only.
'test1', 'need this record', 25, {d '1901-01-01'} 'test3', 'some longer value', 23, {d '1900-01-01'} declare @table1 table ( col1 varchar(10) not null, col2 varchar(200) null, col3 int not null,
Table 1 has "Gender" field with "Male" and "Female" in it, table 2 has "Gender" field with "M" and "F" in it. a query to compare data and list the differences.
i would like to see the 2014-06 matched results (3rd query), if the same ssn and acctno is exist in 2012-06 and 2013-06 and 2014-06 then eliminate from results, otherwise show it
select ssn, acctno From jnj.drgSamples where Channel ='KM' and TrailMonth ='2012-06' select ssn, acctno From jnj.drgSamples where Channel ='KM' and TrailMonth ='2013-06' select ssn, acctno From jnj.drgSamples where Channel ='KM' and TrailMonth ='2014-06'
i have written the below query but it shows only matched across three queries, but i want to display / delete from 2014-06 records if the ssn and acctno is exist in 2012-06 and 2013-06
select c.* from ( (select * From jnj.drgSamples where Channel ='KM' and TrailMonth ='2012-06' ) a join (select * From jnj.drgSamples where Channel ='KM' and TrailMonth ='2013-06' ) b on a.SSN = b.SSN and a.acctno = b.acctno join (select * From jnj.drgSamples where Channel ='KM' and TrailMonth ='2014-06' ) C on a.SSN = c.SSN and a.acctno = c.acctno join )
I was wondering how Fuzzy Grouping deals with and handles first name similarities. Is there a way to configure it so that Anthony = Tony, Bill = William, etc€¦? I created a simple package with several rows containing similar first names and ran the fuzzy grouping on the first name column. I received only one possible duplicate of Will = William which was at 56%. I lowered the threshold down to 1% and still only one match.
Now I understand and appreciate the reasons for this but was wondering if this type of situation was considered and a way of dealing with it is available.
1. Copy old data from each table in LiveDB to same table in ArchiveDB. 2. Delete the data from each table in LiveDB which is in ArchiveDB
Both DBs SIMPLE recovery mode.
Each table has a clustered PK on a single int value. In both DBs
The tables with varchar(max) columns are taking a v.long time to copy over.
IS there anything I can change in the ArchiveDB to make it run faster.
It is the insert that is taking the time. I've tried dropping the clustered PKs in ArchiveDB tables and then rebuilding afterwards but it has not made any difference. After all I am adding data to the ArchiveDB in clustered index order, so wouldn't have expected it to.
How I can change the Archive DB but cannot touch the schema/settings of Live DB.
I've got a fairly standard query that does a group by a type column, and then sums the lengths of a VARCHAR column. I'd like to add into that a concatenated version of the string always concatenating in primary key order. Is that possible?
There are a few databases I work with that have been designed where varchar columns are used to store what actually displays on the front end as Ints, Decimals, Varchars, Datetimes, checkboxes.
I often have to write integrations with these databases bringing data in and prefer to validate the data whilst loading from the staging tables.
I have seen allsorts of values being passed into the staging tables that will load into the target database because the columns are all varchars but the values don't display on the front end because the app actively filters bad values out.
What I would like to do is for my validation scripts to warn up front of potentially invalid datatypes. My problem is that forexample the ISNUMERIC() function return 1 for the value ',1234' but a CONVERT(NUMERIC, ',1234') or CAST(',1234' AS NUMERIC) will fail with a "Error converting data type varchar to numeric).
I've been trying to locate a set of reliable datatype testing functions that will reliably determine if a varchar can be converted to a given data type or not.
I am putting a SELECT statement together where I need to evaluate a results field, to determine how the color indicator will show on a SSRS report. I am running into a problem when I try to filter out any non-numeric values from a varchar field, using a nested CASE statement.
For example, this results field may contain values of '<1', '>=1', '1', '100', '500', '5000', etc. For one type of test, I need a value of 500 or less to be shown as a green indicator in a report, and any value over that would be flagged as a red. Another test might only allow a value of 10 or less before being flagged with a red.
This is why I setup a CASE statement for an IndicatorValue that will pass over to the report to determine the indicator color. Using CASE statements for this is easier to work with, and less taxing on the report server, if done in SQL Server instead of nested SSRS expressions, especially since a variety of tests have different result values that would be flagged as green or red.
I have a separate nested CASE statement that will handle any of the values that contain ">" or "<", so I am using the following to filter those out, and then convert it to an int value, to determine what the indicator value should be. Here is the line of the script that is erring out"
case when (RESULT not like '%<%') or (RESULT not like '%>%') then CASE WHEN (CONVERT(int, RESULT) between 0 and 500) THEN '2' ELSE '0'
The message I am getting is: Conversion failed when converting the varchar value '<1' to data type int.
I thought a "not like" statement would not include those values for converting to an int, but that does not seem to be working correctly. I did also try moving the not to show as "not RESULT like", and that did not change the message.
How I can filter out non-numeric values before converting the rest of the varchar field (RESULT) to int, so that it is only converting actual numbers?
I need to extract specific text elements from a varchar column. There are three keywords in any given string: "wfTask," "wfStatus" and "displayReportFromWorkflow." "wfTask" and "wfStatus" can appear multiple times, but always as a pair and will each be followed by by "==" (with or without surrounding spaces). "displayReportFromWorkflow" is always followed by "(" and there can be spaces on either side. The text elements will be between a pair of double quotes, and following one of keywords. For each row, I need to return the task, status and report name.
Output: rowID, Task, Status, ReportName ----- --------- ------- ------------------------ 1, Issuance, Issued, General Permit 2, Issuance, Issued, Capacity Letter Type III 2, Review, Denied, Capacity Letter Type III
I started with a string splitter using the double quote character, referencing elements "i" and "i+1" where the text like '%wfTask%' or '%wfStatus%' or '%displayReportFromWorkflow%', but the case of multiple task/status in a row has confounded me so far.
Because of a limitation on a piece of software I'm using I need to take a large varchar field and force a carriage return/linebreak in the returned sql. Allowing for a line size of approximately 50 characters, I thought the approach would be to first find the 'spaces' in the data, so as to not split the line on a real word. achieve.
--===== Simulate a passed parameter DECLARE @Parameter VARCHAR(8000) SET @Parameter = (select a_notes from dbo.notestuff as notes where a_id = '1')