We have a data warehouse staging database in which we capture change history for hundreds of tables from a source system. In the source system, records are updated in place, but in our data warehouse we capture these changes by "terminating" the existing record and adding a new record reflecting the changes. In the data warehouse we add two columns to every table -- effective_date and expiration_date -- which indicate the dates the record was in effect in the source system. By convention, an expiration_date of 6/6/2079 means the record is currently still active in the source system. Each day we simply compare yesterday's version of the record (in the data warehouse) against today's version (in the source system). If differences are found in any of the columns, we terminate the record and add a new one, setting those dates appropriately.
In this example, the employee_id column is the natural key in the source system. We add the effective_date and expiration_date in the data warehouse, so those three columns together make up the key in the data warehouse. The employee_name, employee_dept, and last_login_date columns all come from the source system as well.
In the select output, you can follow the trail of changes for each of these three employees. Bob moved from dept 7 to 8 at some point; Frank didn't change departments at all; Cheryl moved from dept 6 to 9 and later back to 6. However, the last_login_date was updated frequently for all these employees.
We've tracked hundreds of tables this way for years, some with hundreds of columns. For optimization purposes, I'm now interested in trimming the fat a bit. That is, we track changes in many columns that we don't really need in our data warehouse. Some of these columns are rapidly-changing, causing all sorts of unnecessary terminate/inserts in the data warehouse. My goal is to remove these columns, reclaim the disk space and increase the ETL speed. So in this example, let's get rid of the last_login_date column.
alter table mytbl drop column last_login_date select * from mytbl order by employee_id, effective_date
Now in the select output, you can see we have many "effective duplicate" records. For example, nothing changed for Bob between 1/1/2014 and 1/31/2014 -- those really should be one record, not three. Here's the challenge: I'm looking for an efficient way to merge these "effective duplicates" together, through set-based sql updates/deletes/inserts (hoping to avoid any RBAR operations). Here's what the table ultimately should look like (cheating to get there):
Note that Bob only has two records (he changed department), Frank only has one record (no changes), and Cheryl has three records (two department changes).
My inclination would be to drop the unwanted columns, then GROUP BY all the remaining columns from the source system, and taking the MIN effective_date and MAX expiration_date. However, this doesn't work for cases like Cheryl's -- she moved to another department, then back again, so that change history needs to be retained.
As I mentioned, we have hundreds of tables, and I'd like to strip out dozens (maybe hundreds) of unused columns, so ultimately there will be millions of these pseudo-duplicates that need to be merged together. These are huge tables, so I really need to find an efficient set-based approach to this.
I have a query that I'm working on, but instead of giving the query, I wanted to ask a basic syntax question. If more info is needed, let me know. If you have 2 rows that have a common relationship, but differing information in some fields, can you merge them all onto one row? I've done this with Sum(case) expressions, but I don't want to 'add' anything. In the following example, the ActivityID refers to a break. ActivityID can be:
0=Pick up 1=Drop Off 2=Lunch 3=Break
So if I wanted to see 2 breaks on 1 row in the following example, would this be possible:
I'm using a shipping program called endicia professional that allows for database manipulation to make my processing easier. I've managed to fix the database here and there but have had an issue combining orders from a single customer when theybuy more than one item. Ideally I would like to have it combine rows when a customer purchases items going to the same address. To avoid having an issue where the address line is the same ie two people live in the same appt complex and it combines these I thought we could use qualifiers as the purchase will have name, order Id that should be unique enough
Order-id name address sku 1234 John 46 easy ln. A27 1234 John 46 easy ln. B32
Results: Order-id name address sku 1234 John 46 easy ln. A27,b32
Hi Im having trouble with this it seems simple enough but its not!
I have a source Table called Access_table example
Name Role1 Role2 Role3 Role4 Role5 a 1 0 0 0 0 a 0 0 1 0 0 b 1 0 0 0 0 c 0 1 0 0 0 d 0 0 0 0 1 e 0 0 1 0 0 e 0 1 0 0 0 f 1 0 0 0 0 g 0 0 1 0 0
I need to create a view that basically finds all the names with double Roles and merge the results into 1 row example. Name Role1 Role2 Role3 Role4 Role5 a 1 0 1 0 0 e 0 1 1 0 0
I cannot change the information in the source table and the results need to be in a view as the roles will change. Every time I try and do this I duplicate the row again. Can anybody suggest a solution.
I need to populate a table from several sources of raw data. For agiven security (stock) it is possible to only receive PARTS ofinformation from each of the different sources. It is also possibleto have conflicting data.I am looking to make a composite picture of a given security using thefollowing rules:1) The goal is to replace all NULL and Blank values with data2) Order of precedence (from highest to lowest) is Non-NULL Non-Blank--> Blank --> NULL3) In the case of Non-NULL Non-Blank values that conflict (aredifferent) leave existing value (even if NULL or Blank)For example:Given the following rows:Symbol Identity IdSource Exchange Type SubType Name-------- ------------ --------- --------- ------- ---------------------------TZA 901145102 CUSIP XNYS Stock NULL TV AZTECATZA 901145102 NULL NULL NULL NULLWSM 969904101 CUSIP XNYS Stock NULL WILLIAMSSONOMAWSM 969904101 NULL XNYS Stock NULLWILLIAMS-SONOMAWSM CUSIP XNYS Stock Common NULLWSM NULL CUSIP XASE Stock NULL WILLIAMSSONOMATYC 902124106 CUSIP XNYS Stock NULL TYCOTYC 902124106 CUSIP XNYS Stock NULL TYCOINTERNATIONALI am looking for the following results ('*' indicates changed value)Symbol Identity IdSource Exchange Type SubType Name-------- ------------ --------- --------- ------- ---------------------------TZA 901145102 CUSIP XNYS Stock NULL TV AZTECATZA 901145102 *CUSIP *XNYS *Stock NULL *TV AZTECAWSM 969904101 CUSIP XNYS Stock *Common WILLIAMSSONOMAWSM 969904101 *CUSIP XNYS Stock *CommonWILLIAMS-SONOMAWSM *969904101 CUSIP NULL Stock Common NULLWSM *969904101 CUSIP XASE Stock *Common WILLIAMSSONOMATYC 902124106 CUSIP XNYS Stock NULL TYCOTYC 902124106 CUSIP XNYS Stock NULL TYCOINTERNATIONAL
I am still fairly new to SQL, having been tasked with creating a csv file from data now someone else has left.
I can do the csv export using sqlcmd and I have the query sorted and am pulling out the right data, but it generates two rows, as one of the tables has multiple records per cardholder. See the query below, I know there is a way of doing it with XML PATH, I think, but it has got me slightly confused.
I have 2 columns (ID, Msg_text) in a table where i need to combine every 3 rows into single row. What would be the best option i have? I know by using 'STUFF' and 'XML PATH' i can convert all the rows into a single row but here i'm looking for every 3 rows into a single row.
I have this query and it works except for I am getting duplicate primary keys with unique column value. I want to combine them so that I have one primary key, but keep all the columns. Example:
Key column 1 column 2 column 3 column 4 A 1 1 A 2 2 B 2 3 B 5 5
it should look like:
A 1 1 2 2 B 2 3 5 5
Here is my query:
SELECT * FROM [TLC Inventory].dbo.['2014 new$'] WHERE [TLC Inventory].dbo.['2014 new$'].mis_key LIKE '2%' AND dbo_Product_Info#description NOT LIKE 'NR%' AND dbo_Line_Info#description NOT LIKE 'OBSOLETE%'
I have a source table in the staging database stg.fact and it needs to be merged into the warehouse table whs.Fact.
stg.fact is not a delta feed it is basically an intra-day refresh.
Both tables have a last updated date so its easy to see which have changed.
It will be new (insert) or changed (update) data that I am interested in, there are no deletions.
As this could be in the millions of rows that are inserts or updates then this needs to be efficient.
I expect whs.Fact to go to >150 million rows.
When I have done this before I started with T-SQL Merge statement and that was not performant once I got to this size.
My original option was to do this is SSIS with a lookup task that marks the inserts and updates and deal with them seperately. However I set up the lookup tranformation the reference data set will have a package variable in the SQL commnd. This does not seem possible with the lookup in 2012! Currently looking at Merge Join transformation and any clever basic T-SQL that could work as this will need to be fast, and thats where I think that T-SQL may be the better route.
Both tables will have >100,000,000 rows Both tables have the last updated date The Tables are in different databases but on the same SQL Instance Each table holds 5 integer columns, one Varchar, one datatime
Last time I used Merge it was a wider table with lots of columns so don't know if this would be an option.
Dear Gurus,I have table with following entriesTable name = CustomerName Weight------------ -----------Sanjeev 85Sanjeev 75Rajeev 80Rajeev 45Sandy 35Sandy 30Harry 15Harry 45I need a output as followName Weight------------ -----------Sanjeev 85Rajeev 80Sandy 30Harry 45ORName Weight------------ -----------Sanjeev 75Rajeev 45Sandy 35Harry 15i.e. only distinct Name should display with only one value of Weight.I tried with 'group by' on Name column but it shows me all rows.Could anyone help me for above.Thanking in Advance.RegardsSanjeevJoin Bytes!
I used the following select statement to get duplicate records on Case_number column
select cases.distinct case_link, cases.case_number from cases group by case_link having case_number > 1
I got the error message that
"'cases.warrant_number' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. AND cases.case_number' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.
Any idea on a better statement to use. THANKS FOR YOUR HELP!
Hi, I have a table and this is what i did to get the desired result
Select A.col1,count(A.col1) from Tab1 group by col1 having count(A.Col1) > 1
i tried this - but it didnot worked - it returned col1 as blanks - Select A.col1,B.Col2,count(A.col1) from Tab1 A, Tab2 B where A.col1 = B.col1 group by A.col1 , b.col2 having count(A.Col1) > 1
As I was looking for all the rows that are apperaing more than once.
Now - The problem -
I have to join this table to another table Tab2 to get the other details. My Tab2 is a table from where I have to pull the Customer DEtails like name,address etc. How should I write this query? Any thinuhts? TIA
Hi. I'm a SQL Server newbie, very experienced with Access, developing an ASP.NET database editor web app. I query the database with a statement more or less in the following form:
SELECT organisation.OrgID, organisation.Name, organisation.whatever FROM services INNER JOIN servicegrouping ON services.serviceID=servicegrouping.serviceID INNER JOIN organisations ON servicegrouping.OrgID = organisations.OrgID WHERE services.service=x OR services.service=y
In other words, I have a database of organisations. The services offered by the organisations are in a separate table, and I only want to return organisations that offer services X or Y.
Okay, now if I did this in Access, this query would return just one record for each organisation that meets the condition, unless I was to include a field from the services table in the SELECT clause, in which case of course I would get one record for each organisation and unique service offered.
But in MS SQL, the query returns duplicate rows if there is more than service offered by the organisation that meets the WHERE condition (=x or =y). Why is this and what do I need to do to my SQL statement to ensure I only get unique rows?
I've a query which gets a set of data from multiple tables -
select * FROM A inner JOIN q ON (RIGHT(q.name,CHARINDEX('-',REVERSE(q.name))-1)= a.id) inner JOIN t ON (t.id = q.id) inner JOIN s ON (q.name = s.name ) inner join l on (s.name = l.name and t.name = l.name)
WHERE A.id = 764 and s.name = '764'
I get repeated # of rows for each id. I've some 136 rows for each q.id ( there are 6 q.ids and hence I get 816 rows instead of 136) These 136 rows are actually divided among thse q.ids as
Hello, I have a question, what does a statement look like that finds the duplicate rows and combines them, I have a table named PRODUCTS in it 3 columbs Cost, Stock, Part_number. I need to find all Part_numbers that dublicate, Combine the rows into 1 & combine (sum, add) their stock together is the new row & take an avarerage of their cost and use it as cost in the new row where they combine. Please help me, I am stalled. Looked all over the internet & could not find anything, I really need this for a project I can not finish. I have the following SQL statement: SELECT part_number, COUNT(part_number) AS NumOccurrences FROM Products GROUP BY Part_number HAVING COUNT(part_number) > 1
I have a csv file that I need to import daily into a SQL Server 2005 table. Much of the table contents could just be overwritten with the new csv file, however there are a set of Rows within the table that need to be appended to , rather than overwritten. There is no Primary Key in the csv file that can be used. I'm not sure this is the best approach, but what I have been trying to do, is append the entire csv file to the existing table, and then go back and delete the duplicates. When I run the Delete, it does delete the majority of the records, but leaves a couple hundred behind. The number left behind varies with each run, can't seem to identify a pattern here. Running the Delete a second time does clean up the rows left behind in the first execution of the Delete, and gives the result I want. Any thoughts as to why this needs to be run twice? Or is a better approach available? Here is my code - SELECT [Pkg ID], [Elm (s)], [Type Name (s)], [End Exec Date], [End Exec Time], dupcount=count(*) INTO temppkgactions FROM pkgactions GROUP BY [Pkg ID], [Elm (s)], [Type Name (s)], [End Exec Date], [End Exec Time]HAVING count(*) > 1
DELETE TOP (SELECT COUNT(*) -1 FROM dbo.temppkgactions WHERE dupcount > 1 ) FROM dbo.pkgactions DROP TABLE temppkgactions
I have a large table that consists of the columns zip, state, city, county. The primary key "zip" has duplicates but the rows are unique. How do I filter out only the duplicate zips. Randy Garland
Hi, I am encountering a problem. There are lots of duplicate rows in the cobol flat files (due to improper data entry and missing columns values )from where I am transforming data to sql 7. 0 tables using DTS. After transformation , can I some how mark the duplicate rows ? it is not for the purpose of eliminating them, but to enter the missing values and make all the rows complete and unique. I have the transformed table as a temporary table. Can I add a column like 'status' etc.. and have the column values marked '1' for the repeating rows etc.... Can anyone suggest 'any' possible way of implementing it ? Thanx Nisha
I have problem in deleting duplicate rows. I have a identity column in my table, if I try to use correlatted sub query with Delete command it gives error.
The other problem I have is I have a date column in my table and update that column with current date and time. If use a query to fetch a records on a particular day , it does not return any rows
select * from rates where ch_date >='02/11/99' and ch_date<='02/11/99'
If I use convert also there is some other problems. Is there any way to force date checkings to be done excluding time.
CAN ANYBODY REPLY FOLLOWING QUESTIONS. I WANT TO DELETE DUPLICATE ROWS IN MY TABLE WITHOUT USING TRANSACTION TABLE. AND ONE MORE QUESTION HOW TO GET YESTERDAY DATE BY USING ISQL WINDOW.
This is an imaginary problem while discussing ROWID in ORACLE.
Consider a table without primary key, unique key, uniuqe index. A row has inserted into the table many times. I want to delete all but one dulicated rows. With any 'where' clause all rows(duplicated) will be deleted. In ORACLE i can achieve this using ROWID as follows:
Delete from Table_name where < all column values > and ROWID <> ( Select max(rowid) from Table_name where < all column values > )
How can this be achieved in MS SQL Server 6.5 ?
According to Dr. Codd's Golden rules for RDBMS one is that One should be able to reach each data value in the database by using table name, row idenfication value and column name.
Does MS SQL Server 6.5 satisfy this requirement ?
Also How many of Dr. Codd's 13 Golden Rules for RDBMS does MS SQL Server 6.5 Satisfy? Which doesn't ?