Finding Duplicate Entries In A Smart Way - By Comparing First Two Words
Jul 20, 2005
What is the best way to compare two entries in a single table where
the two fields are "almost" the same?
For example, I would like to write a query that would compare the
first two words in a "company" field. If they are the same, I would
like to output them.
For example, "20th Century" and "20th Century Fox" in the company
field would be the same.
How do I do this? Do I need to use a cursor? Is it as simple as using
"Like?"
Yet another simple query that is eluding me. I need to find records in a table that have the same first name and last name. Because the table has a primaty key, these people were entered twice or they share the same first and last name.
How could you query this:
ID fname lname 10001 Bill Jones 10002 Joe Smith 10003 Sue Jenkins 10004 John Sanders 10005 Joe Smith 10006 Harrold Simpson 10007 Sue Jenkins 10008 Sam Worden
and get a result set of this:
ID fname lname 10002 Joe Smith 10005 Joe Smith 10003 Sue Jenkins 10007 Sue Jenkins
Hi, I am searching for the most easy SQL solution:
Lets say i have 500000 rows of cars in one table with 30000 different car ID's. This table contains no keys, no date values, but I need the last row of these 30000 different cars. How do I get them without adapting table structure and without using cursors ?
I'm extracting data from a log (log_history) of patients where nurses perform various actions on a call, such as assessing and reassessing, despatching etc. This is the script:
Select L.URN, LH.THE_TIMESTAMP, LH.ACTION_TYPE, LH.ACTION_BY, LH.ACTION_REQD, LH.NOTE, em.position_type_ref From LOG L Join Log_history LH on (L.URN = LH.LOG_URN) left outer join employee em on (em.code = LH.action_by) Where (L.Taken_at >= :DateFrom and L.Taken_at <= :DateTo) and (LH.ACTION_TYPE = 'D') and (em.position_type_ref ='NU')
Order By L.URN ASC, LH.THE_TIMESTAMP DESC
The result I get shows duplicate 'timestamp' entries and I only want to return unique timestamp entries. Does anyone have any ideas. I'm self taught and have hit a wall
I have an application that allows the user to enter data into a table. There are multiple users so I put in some code that, I thought, would keep 2 users from creating a new record at the same time. The IDs for the records are identical and this is causing a problem.
The IDs are in the format of ####-mmyy. at the start of each month the #### part goes back to 1.
We tried a test today where we had 2 users click on the New button at exactly the same time. The IDs that were created were identical. Is there anyway on the database that I can prevent this from happening?
Here is how I create the new record id:
I get the MAX(ID) from the table I add 1 to the ID and then insert a new record with the new ID into the table.
I have an issue where certain parts of data are repeated several times after i create my query. Without providing my SQL code for now could anyone suggest possibly the main reason(s) for data being duplicated?
I am a newb at ms sql and was hoping someone could help me eliminate duplicate PRODUCT.PRODUCT from this statement. I have tried using DISTINCT with the same results.The ProductImage table is causing this because the duplicates are from the PRODUCT.PRODUCT that have more than 1 image.
If anyone could rewrite this statement so I can learn from this, it would be most appreciated!
Hi. I have a table with Login and Logoff Time of users, but there could be duplicate Logtimes in the dataset, but for different products. Because of this I cant do a distinct in the dataset. I need the Product and some other details in my Report.
I tried to make two datasets. One for the Select distinct and one for the other.
But the Problem is: in my report, I need a table, where I make the Sum of the Logintime a day and in another column I calculate with data from the other dataset.(Logtime + data from dataset2). But this doesnt work, so I think, that is it not possible to join 2 dataset in one table.
Hi! I am joining 3 tables in SQL , I am getting the results I want exept it's duplicated. So the resultinmg table fom my stored procedure has 3 rows that have the same bulletin. How do I filter the storedprocedure to output only the rows that don't have duplicate entries for the column 'Bulletin' Thanks. Here is my stored procedure:PROCEDURE [dbo].[spGetCompBulletins] @Userid uniqueidentifier OUTPUT,@DisplayName varchar(200)
AS
SELECT * FROM dbo.UserProfile INNER JOIN dbo.bulletins ON dbo.UserProfile.UserId = dbo.bulletins.Userid INNER JOINdbo.Associations ON dbo.Associations.BusinessID = dbo.bulletins.Userid WHERE UserProfile.DisplayName=@DisplayName and Userprofile.Userid = @Userid ORDER BY Bulletins.Bulletin_Date Return
Hi all.. I've been scouring the forums for about 6 hours to no avail. This is a really simple question. I'm trying to have a registration page that lets the user input name, email, desired username, and password. I want to check the username and email fields to make sure ppl cannot sign up twice. So from what I've gathered I have a couple of options:
1) i can set up a unique constraint on the database columns, 2) i can run a select statement before inserting, 3) i can store the whole database column in a variable then search through it.
My question is how to do option 2? All of my transactions are through a sqldatasource object in c#.
Below is a snippet of MS SQL inside ASP that retieves Commodity info such as product names and related information and returns the results in an ASP Page. My problem is that with certain searches, elements returned in the synonym field repeat. For instance, on a correct search I get back green, red, blue, and yellow which is correct. On another similar search different commodity say for material, I get Plastic, Glass,Sand - Plastic, Glass,Sand - Plastic Glass Sand. I want to remove the repeating elements returned in this field. I hope this makes sense.
PS I tried to use distinct but with no luck I want just one of each in the example below.
Thanks in Advance!
Scott
==============================
SQL = "" SQL = "SELECT B.CIMS_MSDS_NUM," & _ "A.COMMODITY_NUMBER, " & _ "B.CIMS_TRADE_NME," & _ "B.CIMS_MFR_NME," & _ "B.CIMS_MSDS_PREP_DTE," & _ "B.APVL_CDE," & _ "COALESCE(C.REGDMATLCD,'?') AS DOTREGD," & _ "COALESCE( D.CIMS_TRADE_SYNM,'NO SYNONYMS') AS SYNONYM, " & _ "A.MSDS_CMDTY_VERIF, " & _ "A.CATALOG_ID " & _ "FROM ( MATEQUIP.VMSDS_CMDTY A " & _ " RIGHT OUTER JOIN MATEQUIP.VCIMS_TRD_PROD_INF B " & _ " ON A.CIMS_MSDS_NUM = B.CIMS_MSDS_NUM " & _ " LEFT OUTER JOIN MATEQUIP.VDOT_TRADE_PROD C " & _ " ON A.CIMS_MSDS_NUM = C.MSDSNUM " & _ " LEFT OUTER JOIN MATEQUIP.VCIMS_TRD_PROD_SYN D " & _ " ON B.CIMS_MSDS_NUM = D.CIMS_MSDS_NUM) "
Hi I am trying to insert entries in a table which has a composite primary key and i am inserting it on UID basis.
INSERT INTO TABLE_B (TABLE_B_UID,NUM_MIN, NUM_MAX,BIN, REGN_CD, PROD_CD, CARD) (SELECT UID,LEFT(NUM_MIN,16),LEFT(NUM_MAX,16),BIN, REGN_CD, PROD_CD, CARD FROM TABLE_A WHERE UID NOT IN (SELECT TABLE_B_UID FROM TABLE B))
When i insert it tries to insert a duplicate entries and gives me an error. Since I am new to SQL SERVER 2000 i need some help. I tried IF NOT EXISTS, EXCEPT but i guess i am wrong at the syntax.
I have a table with no primary key and i just want to see all the duplicate entries on the basis of two columns. Can anyone suggest me how should i go about it.
Can anyone provide me the syntax for the same? I have only 1 table say ISSR_TBL and two columns using which i want to delete the duplicate ones. i.e. MIN and MAX.
I have a database being populated by hits to a program on a server.The problem is each client connection may require a few hits in a 1-2second time frame. This is resulting in multiple database entries -all exactly the same, except the event_id field, which isauto-numbered.I need a way to query the record w/out duplicates. That is, anyrecords exactly the same except event_id should only return one record.Is this possible??Thank you,Barry
Below is a snippet of MS SQL inside some VB that retieves Commodity info such as product names and related information and returns the results in an ASP Page. My problem is that with certain searches, elements returned in the synonym field repeat. For instance, on a correct search I get green, red, blue, and yellow which is correct. On another similar search with different commodity say for material, I get Plastic, Glass,Sand - Plastic, Glass,Sand - Plastic, Glass, Sand. I want to remove the repeating elements returned in this field. IOW, I just need one set of Plastic, Glass and Sand. I hope this makes sense.
Below is the SQL and the results from the returned page.
PS I tried to use distinct but with no luck I want just one of each in the example below.
Thanks in Advance!
Scott
==============================
SQL = "" SQL = "SELECT B.CIMS_MSDS_NUM," & _ "A.COMMODITY_NUMBER, " & _ "B.CIMS_TRADE_NME," & _ "B.CIMS_MFR_NME," & _ "B.CIMS_MSDS_PREP_DTE," & _ "B.APVL_CDE," & _ "COALESCE(C.REGDMATLCD,'?') AS DOTREGD," & _ "COALESCE( D.CIMS_TRADE_SYNM,'NO SYNONYMS') AS SYNONYM, " & _ "A.MSDS_CMDTY_VERIF, " & _ "A.CATALOG_ID " & _ "FROM ( MATEQUIP.VMSDS_CMDTY A " & _ " RIGHT OUTER JOIN MATEQUIP.VCIMS_TRD_PROD_INF B " & _ " ON A.CIMS_MSDS_NUM = B.CIMS_MSDS_NUM " & _ " LEFT OUTER JOIN MATEQUIP.VDOT_TRADE_PROD C " & _ " ON A.CIMS_MSDS_NUM = C.MSDSNUM " & _ " LEFT OUTER JOIN MATEQUIP.VCIMS_TRD_PROD_SYN D " & _ " ON B.CIMS_MSDS_NUM = D.CIMS_MSDS_NUM) "
Here is my situation. I have a table in my application that pairs users with cars they like. We'll call this table Favorites. A user can browse the site and they can designate as many cars they want as favorites. For example, a user can go to the Honda Accord page and add that as a favorite car and then go to the Toyota Camry page and add that as a favorite car. However, if he/she goes to that Honda Accord page and tries to click the "Add to Favorites" button again, at the present state of my application, it will just add another entry into the Favorites table with a duplicate pairing. So, if I were to datalist the table to generate a listing of all favorites belonging to a certain user, he/she may potentially be returned with superfluous duplicate entries. Not to mention, taking up valuable database space and not looking very professional. In my Favorites table, the 3 fields are.....favoriteId (set as primary key)userIdcarId I've been thinking about this for awhile and I've come up with 2 solutions. I'm a newbie to ASP.NET/programming so I don't have enough insight to make a decision or to even think up of other alternatives. 1) Check proactively by doing a.....SELECT favoriteID FROM Favorites WHERE userId = x and carId = y (where x and y are variables)If I get a null return, it means I can go ahead and let the user add the car as a favorite in the database. If I get a valid value, then it means there already exists the same pairing, so I exit out without updating the table. 2) Check reactively by forcing an exception whenever a user tries to enter a duplicate pairing. I'm not sure how to do this, but perhaps, instead of making "favoriteId" a primary key, perhaps, I can make a primary key pairing of "userId" and "carId". And by trying to do an insert with a primary key that already exists, we know it won't work since primary keys by definition are unique. Now, I expect some concurrent users on my site, so I must take into consideration pros and cons of each and determine which is more efficient. Checking proactively will force a check even if the table does not contain a duplicate pairing of user and car. However, having a duplicate primary key may be more expensive from a database point of view and may slow down lookups, etc. Or maybe neither has significant benefits, in which case, I rather go with proactive, since I've already coded it and it works fine. Or maybe there is a third alternative, which I did not think. Which method do programmers usually take and which is a better practice? TIA for your help.
Someone ran an update statement multiple times so their are multiple entries in the table. What is the quickest way to track down the multiple entries? I would only want to see where timein and timeoff exist in the table multiple times for the same id. So this would be a duplicate
Planning - contains a list of planned items. Used to define boundaries for a work day and defines based on type what can be done for each item.
Id, TypeId - the type of the planned items BeginTime DateTime - begin date and time of the planned item EndTime DateTime - end date and time for the planned item
In the Planning table we can have as many records per day as we need:
1, First Meeting, 1 Jan 2008 09:00, 1 Jan 2008 11:00 2, First Meeting, 1 Jan 2008 11:00, 1 Jan 2008 12:00 3, First Meeting, 1 Jan 2008 13:00, 1 Jan 2008 15:00 4, First Meeting, 1 Jan 2008 15:00, 1 Jan 2008 18:00
Appointments - contanis a list with appointments
Id, BeginTime DateTime EndTime DateTime
1, 1 Jan 2008 09:00, 1 Jan 2008 09:30 2, 1 Jan 2008 10:00, 1 Jan 2008 11:00 3, 1 Jan 2008 11:00, 1 Jan 2008 11:30 4, 1 Jan 2008 14:00, 1 Jan 2008 15:30
What is needed?
What I need is to a find a way to compare the planned items with the appointments and to return all the periods for which a planned time exists:
Free planned time:
1, 1 Jan 2008 09:30, 1 Jan 2008 10:00 2, 1 Jan 2008 11:30, 1 Jan 2008 12:00 3, 1 Jan 2008 13:00, 1 Jan 2008 14:00 4, 1 Jan 2008 15:30, 1 Jan 2008 18:00
So, having two multitudes of periods,where the one specifies the planning templates and the other real used time, I need to find all the periods which can be used for another appointments.
I've tried several aproaches, but I always faced performance problems.
Hi all, I have two tables - Planning and Appointments:
Planning - contains a list of planned items. Used to define boundaries for a work day and defines based on type what can be done for each item. Id, TypeId - the type of the planned items
BeginTime DateTime - begin date and time of the planned item EndTime DateTime - end date and time for the planned item
In the Planning table we can have as many records per day as we need:
1, First Meeting, 1 Jan 2008 09:00, 1 Jan 2008 11:00 2, First Meeting, 1 Jan 2008 11:00, 1 Jan 2008 12:00 3, First Meeting, 1 Jan 2008 13:00, 1 Jan 2008 15:00 4, First Meeting, 1 Jan 2008 15:00, 1 Jan 2008 18:00
Appointments - contanis a list with appointments Id,
BeginTime DateTime EndTime DateTime
1, 1 Jan 2008 09:00, 1 Jan 2008 09:30 2, 1 Jan 2008 10:00, 1 Jan 2008 11:00 3, 1 Jan 2008 11:00, 1 Jan 2008 11:30 4, 1 Jan 2008 14:00, 1 Jan 2008 15:30
What is needed? What I need is to a find a way to compare the planned items with the appointments and to return all the periods for which a planned time exists:
Free planned time: 1, 1 Jan 2008 09:30, 1 Jan 2008 10:002, 1 Jan 2008 11:30, 1 Jan 2008 12:00 3, 1 Jan 2008 13:00, 1 Jan 2008 14:00 4, 1 Jan 2008 15:30, 1 Jan 2008 18:00
So, having two multitudes of periods,where the one specifies the planning templates and the other real used time, I need to find all the periods which can be used for another appointments. I've tried several aproaches, but I always faced performance problems.
I am using Sql Server 2000. I have a customer table with fields - CustId, Name, Address, City, StdCode, Phone. I used to insert entries in this table from an excel file. One excel file will contain thousands of customer. In this table combination of StdCode and Phone should not be repeated. If I do it in my VB.Net coding.then application gets drastically slow. So I want to write a procedure or trigger for this. Here what I will do, I will send all records into database then this trigger or procedure will check for any existing entry of combination of StdCode and phone. If entry exists then this will delete new entry or will not allow this new entry. Is this possible to do using Trigger or stored procedure?
How can I delete duplicate entries from tables in my database using Query Analyzer, as there are many duplicate entries in my tables, I want to delete them.
I have this 40,000,000 rows table... I am trying to clean this 'Contacts' table since I know there are a lot of duplicates.
At first, I wanted to get a count of how many there are.
I need to compare records where these fields are matched:
MATCHED: (email, firstname) but not MATCH: (lastname, phone, mobile). MATCHED: (email, firstname, mobile) But not MATCH: (lastname, phone) MATCHED: (email, firstname, lastname) But not MATCH: (phone, mobile)
I am trying to compare two flat files and extract new entry into new file.But in my case there is no key column in both flat files. is any way to find the new entry by checksum with out Key matching?.
I searched for all the posts which covered my question - but none were close enough to answer what i'm trying to do. Basically, the scenario is thus;
Table1 contains values for UserID, Account code, and Date.
My query (below) is trying to find all the accounts assigned to a particular user ID, but also those duplicate account codes which belong to a second user ID. The date column would be appended to the result set.
The query I'm using is as follows;
select acccountcode, userid, date from dbo.table1 where exists (select accountcode from dbo.table1 where accountcode = table1.accountcode group by accountcode having count(*) > 1) and userid = 'x-x-x' order by accountcode
What I think this produces is a list of all files where a duplicate exists, but of course it leaves out the 2nd UserID...which is crucial.
Hopefully this makes sense. Any insight my fellow DBA's can share would be greatly appreciated!
It seems that there should be a solution for my situation, but for the life of me I can't seem to figure it out.
I need to compare two "like" tables, containing similar data. Tbl 1 is "BOOKED" (which is a snapshot of inventory) and tbl 2 is "CURRENT" (the live - working inventory table). If I write my query as follows the the subsequent result is "duplicate" data.
Code Block SELECT booked.item, booked.bin, booked.quantity, current.bin, current.quantity FROM BOOKED LEFT JOIN CURRENT ON booked.item = current.item
No matter what type of join I use, there is duplicate data displayed for each table. For example, if there are more bins in the BOOKED table that contain a certain product then the CURRENT table will repeat data and vica versa.
As follows:
Item Bin Quantity Bin Quantity
12345 A01 500 A01 7680
12345 B01 6 A01 7680
12345 C01 20 A01 7680
54321 G10 1032 E15 1163
54321 G10 1032 F20 523
54321 G10 1032 H30 750
98765 Z20 7000 Z20 8500
98765 Y15 2500 Y15 3000
98765 X10 1200 Y15 3000
What I would like to do is display Bin and Quantity only once and the repeating values as NULL or [BLANK]. Or, to display all of the bins from both tables and only the quantities from each table in relation to the bin found in that table, returning a "0" if no quantity exists.
This is what I'm after:
Item Bin Quantity Bin Quantity
12345 A01 500 A01 7680
12345 B01 6 B01 0
12345 C01 20 C01 0
54321 G10 1032 E15 1163
54321 F20 0 F20 523
54321 H30 0 H30 750
98765 Z20 7000 Z20 8500
98765 Y15 2500 Y15 3000
98765 X10 1200 X10 0
Is this possible? If so, how?
I also might add that it is ok for each table to contain multiple entries for any given item. This is basically being requested as an inventory variance report - inventory before physical count and immediatly after physical count - and will only be run once a year.
----------------------------------------------- Just thinking out loud here: What if I created three subqueries, the first containing only BOOKED information, the second containing only CURRENT information and the third being a UNION of both tables? Something like this:
Code Block SELECT q3.bin, q1.item, ISNULL(q1.quantity, 0) as QTY_BEFORE, ISNULL(q2.quantity, 0) as QTY_AFTER
FROM
(select item, bin, quantity from BOOKED)q1 Left Join
(select item, bin, quantity from CURRENT)q2 on q1.item = q2.item Left Join
(select bin, item from BOOKED UNION CURRENT)q3 on q1.item = q3.item
Order By q1.item
I don't know if I wrote the UNION statement correctly, but I will have to try this when I get back to work...
I am putting my problem in an example as I Feel it would be clear.
Assume my table PEOPLE is having 4 columns with 6 rows, the SlNo being primary key. SlNo Name LastName birthdate 1 A B x -- 2 C B x |-- 1 pair (A, B, x) 3 D E y --|------------ 4 A E y | | 5 A B x __| |-- 2'nd pair (D, E, y) 6 D E y --------------- In this scenario, I need to find SlNo values having similar values in other columns. The o/p for above must be: 1 5 0 3 6 0 (0 needs to include in output for distinction in the sets)
(a)IS THIS POSSIBLE TO DO IN ONE SELECT STATEMET? and HOW? (b)If I create another temp table tempPEOPLE and select distinct row information of the 2'nd, 3'rd and 4'th columns from the PEOPLE table and then selecting SlNo's where the information match, I am able to get o/p 1 5 3 6 without 0...and I cannot makeout the distinct sets in this. HOW DO I FIND THE DISTINCTION IN SETS?
I have a problem with a 3rd party piece of software. Doesn't matter which, really. The problem lies in a table called payments, with a column called txnumber...the newest version of this software fails a check during installation with the message "duplicate txnumber in payment table." Not sure how this could have happened, since there is no way to manually assign the txnumber, but the point is not important. What I'd like to do is figure out a sql script that will return only the duplicate number(s) so that I can either remove or change them manually. Unfortunately, I'm not terribly familiar with sql.
The duplicates that this thread relates to are the kind with duplicate "keyword" entries AND dissimilar field entries; i.e. :
Code:
keyword negative exact broad Phrase Polo 0 122 4 Polo 0 122 5
I've come up with an SQL query that seems to return all of these duplicates (save one of each type- the 'real', unique entry). However I think I made the query very inefficient. My SQL is very bad; this query will be running over tens of thousands of rows, so if it can be at all optimized I would greatly appreciate your help!
What I have so far is:
Code:
string query1 = "SELECT * FROM TableName" + " WHERE EXISTS (SELECT NULL FROM TableName" + " b" + " WHERE b.[keyword]= " + "TableName"+ ".[keyword]" + " AND b.[negative]<> " + "TableName"+ ".[negative]" + " ORb.[keyword]= " + "TableName"+ ".[keyword]" + " ANDb.[exact]<> " + "TableName"+ ".[exact]" + " ORb.[keyword] = " + "TableName"+ ".[keyword]" + " ANDb.[broad]<> " + "TableName"+ ".[broad]" + " ORb.[keyword]= " +"TableName"+ ".[keyword]" + " ANDb.[phrase]<> "+"TableName"+ ".[phrase]" + " GROUP BY b.[keyword], b.[broad], b.[exact]" + " HAVING Count(b.[keyword]) BETWEEN 2 AND 50000)" ;
the algoritm seems to check every column of every row in order to determine a duplicate. Seems straightforward to me, but alas slow...
Is there a better/faster way I can do this? Thanks for you help!