SQL Server 2012 :: Merging Two Large Tables (More Than 100m Rows)
Aug 18, 2014
SQL 2012
I have a source table in the staging database stg.fact and it needs to be merged into the warehouse table whs.Fact.
stg.fact is not a delta feed it is basically an intra-day refresh.
Both tables have a last updated date so its easy to see which have changed.
It will be new (insert) or changed (update) data that I am interested in, there are no deletions.
As this could be in the millions of rows that are inserts or updates then this needs to be efficient.
I expect whs.Fact to go to >150 million rows.
When I have done this before I started with T-SQL Merge statement and that was not performant once I got to this size.
My original option was to do this is SSIS with a lookup task that marks the inserts and updates and deal with them seperately. However I set up the lookup tranformation the reference data set will have a package variable in the SQL commnd. This does not seem possible with the lookup in 2012! Currently looking at Merge Join transformation and any clever basic T-SQL that could work as this will need to be fast, and thats where I think that T-SQL may be the better route.
Both tables will have >100,000,000 rows
Both tables have the last updated date
The Tables are in different databases but on the same SQL Instance
Each table holds 5 integer columns, one Varchar, one datatime
Last time I used Merge it was a wider table with lots of columns so don't know if this would be an option.
create table dbo.#Status( ID varchar(50), Status varchar(50), EffectiveStartDate datetime, EffectiveEndDate datetime, Is_Current bit
[code].....
I want result as the attached image.
Create table query for result is: CREATE TABLE dbo.#Result( ID varchar(50), Fee varchar(100), Bill varchar(50), A_Date date, B_Date date, Status VARCHAR(50), EffectiveStartDate datetime, EffectiveEndDate datetime )
I have two tables. Status and Fourhistory tables.Status table contains a status column with effectivestart and end dates as history. This column is having history at month level.
Fourhistory table maintains 4 columns as part of history with the use of effectivestart and end dates. Here history capturing is at day level.
Desired Result: I want to merge the status column into FourHistory table.Below i have given some possible sample scenarios which i face and the third table contains the expected ouput.how to achieve this in T-SQL query.
create table dbo.#Status( ID varchar(50), Status varchar(50), EffectiveStartDate datetime, EffectiveEndDate datetime, Is_Current bit
In a Library Management database we have these tables
1) Document ( DocNo , Doc_type , permalink,inDate) 2)Title(id, DocNo,Main_Title, Other_Title) 3)Author(id , Author_Name , Author_Family,Type--Like:main author , translator ,....) 4)Publisher(id,DocNo , Name,Publisedate,address) 5)Subject(id,DocNo,Subject) 6)Description(id,DocNo,ISBN,description)--one document may have some ISBN,etc
In document table I have 500,000 records.
I want to search a word in these tables ,for example i want to search 'Computer' ,this word may be in subject or title or description and etc. How can I do this with best performance?
I run the script below once a day to keep track of row count over time. I would like to compare the results from today and yesterday to see if anyone deleted more than 20% of data from any given table. How would I do this? I really don't need the data anymore than a day just to compare the results.
Mon - Run script to collect row count Tues - Run script to collect current row into temp table ,compare all row count in both tables ,purge records from Monday and insert current Wed - Run script to collect current row into temp table ,compare all row count in both tables
Table structure: col1 IDENTITY (seed=1 increment=1) + few other columns (col2...col7) + one text column (col 8) I have around 50,000,000 rows per day inserted in the table T1. At the end of the day 40,000,000 rows are deleted. I have to keep the records for 12 months and then archive it. Database is 24/7 web serving and there is no down time allowed. IDENTITY column will go out of range (overflow) after less than two years, unless the identity seed is reset to the start value (seed=1, increment=1). At the end of 12th month data is archived in another table and only last month is kept in the table T1. So table T1 enters new year with data from last month of the previous year. There are few other tables that refer to this table by using there own field with values from T1.IDENTITY column (referential integrity is not enforced). Identity column in T1 is needed as a unique id for some search actions. Performance is an issue therefore bigint data type is used for this identity column rather than decimal.
Another problem I have is how to do table update on one column (1 mil rows to be updated out of 2 mil of rows) with the minimum impact on the users who are querying this table heavily. Not need to mention that it is web app 24/7 no down time.
If F.parent_id(101)=T.team_id(101) and T.team_id(101)=T.parent_folder_id (101) then output should come as 'Mobile/c' (this is for f.parent_id=101)
If F.Parent_id=T.team_id and T.team_id!=T.parent_folder_id then parent_folder_id have to start search on team_id column where it got match and pick the Team_name from that corresponding id
Ex: F.parent_id=202 is matching with T.Team_id (202) but this T.team_id(202) is not matching with T.parent_folderid(200) , so this T.parent_folderid (200) have to search on T.id (200) ,if now T.id(200) is matching with T.Parent_folder_id(200) then it have to give the names from the starting hirache
In another forum post, a poster was deleting large numbers of rows from a table in batches of 50,000.
In the bad old days ('80s - '90s), I used to have to delete rows in batches of 500, then 1000, then 5000, due to the size of the transaction rollback segments (yes - Oracle).
I always found that increasing the number of deleted rows in a single statement/transaction improved overall process speed - up to some magic point, at which some overhead in the system began slowing the deletes down, so that deleting a single batch of 10,000 rows took more than twice as much time as deleting two batches of 5,000 rows each.
good rule-of-thumb numbers (or even better, some actual statistics and/or explanations) as to how many records should be deleted in a single transaction/statement for optimum speed? 50,000 - 100,000 - 1,000,000 or unlimited? Are there significant differences between 2008, 2012, 2014?
Other than right-clicking on each individual table in SSMS and generating a CREATE script, is there a simple way to generate CREATE TABLE scripts for tables within a given database?
Background: I have a bunch of tables in one database, and I would like to add tables to a second database that have the same names and basic structures of some of the tables from the first database.
I do not need to transfer any data from the tables, this is a seperate project that will use a similar data structure. I just want to generate the CREATE TABLE scripts for 30ish tables within the first database, and then I'll tweak the scripts as appropriate and run them against the new database.
I inherited a system which has an index on a set of columns which allow more than 900 bytes of data in it. We know one of the fields can be shortened to shrink the potential key size below 900 bytes.
The problem is the table is about 120m rows, and the index currently on that column is seeked (sought?) on about 2.5m times a day.
At its simplest, I want to drop the existing index, alter the column to shrink the varchar size, and then rebuild the index on the newly shortened column.
On a smaller, less used table, I might just do this in outside of business hours and call it a day, but I'm concerned that this will take a long time and block a lot of operations.
1) IIRC, shrinking a column, unlike widening it, is much more expensive, even if there are no values which would actually end up trunacted. Is this right?
2) I did a few tests on some other smaller (2+ m) row tables and was still able to select data out of the table. I don't think this covered all the read scenarios, but are there known scenarios which would simply not work during an index build?
3) I haven't yet tried DML operations to the table while it's doing either the column update or an index build. what scenarios would or would not be blocked?
I have a query that I'm working on, but instead of giving the query, I wanted to ask a basic syntax question. If more info is needed, let me know. If you have 2 rows that have a common relationship, but differing information in some fields, can you merge them all onto one row? I've done this with Sum(case) expressions, but I don't want to 'add' anything. In the following example, the ActivityID refers to a break. ActivityID can be:
0=Pick up 1=Drop Off 2=Lunch 3=Break
So if I wanted to see 2 breaks on 1 row in the following example, would this be possible:
I'm using a shipping program called endicia professional that allows for database manipulation to make my processing easier. I've managed to fix the database here and there but have had an issue combining orders from a single customer when theybuy more than one item. Ideally I would like to have it combine rows when a customer purchases items going to the same address. To avoid having an issue where the address line is the same ie two people live in the same appt complex and it combines these I thought we could use qualifiers as the purchase will have name, order Id that should be unique enough
Order-id name address sku 1234 John 46 easy ln. A27 1234 John 46 easy ln. B32
Results: Order-id name address sku 1234 John 46 easy ln. A27,b32
Hi Im having trouble with this it seems simple enough but its not!
I have a source Table called Access_table example
Name Role1 Role2 Role3 Role4 Role5 a 1 0 0 0 0 a 0 0 1 0 0 b 1 0 0 0 0 c 0 1 0 0 0 d 0 0 0 0 1 e 0 0 1 0 0 e 0 1 0 0 0 f 1 0 0 0 0 g 0 0 1 0 0
I need to create a view that basically finds all the names with double Roles and merge the results into 1 row example. Name Role1 Role2 Role3 Role4 Role5 a 1 0 1 0 0 e 0 1 1 0 0
I cannot change the information in the source table and the results need to be in a view as the roles will change. Every time I try and do this I duplicate the row again. Can anybody suggest a solution.
Hello All,I have an issue with dupliate Contact data. Here it is:I have a Contacts table;CREATE TABLE CONTACTS(SSN int,fname varchar(40),lname varchar(40),address varchar(40),city varchar(40),state varchar(2),zip int)Here is some sample data:SSN: 1112223333FNAME: FRANKLNAME: WHALEYADDRESS: NULLCITY: NULLSTATE NYZIP 10033SSN: 1112223333FNAME: NULLLNAME: WHALEYADDRESS: 100 MADISON AVECITY: NEW YORKSTATE NYZIP NULLHow do I merge the 2 rows to create one row as follows:via SQL or T-SQLSSN: 1112223333FNAME: FRANKLNAME: WHALEYADDRESS: 100 MADISON AVECITY: NEW YORKSTATE NYZIP 10033Pointers appreciated.Thanks
I need to populate a table from several sources of raw data. For agiven security (stock) it is possible to only receive PARTS ofinformation from each of the different sources. It is also possibleto have conflicting data.I am looking to make a composite picture of a given security using thefollowing rules:1) The goal is to replace all NULL and Blank values with data2) Order of precedence (from highest to lowest) is Non-NULL Non-Blank--> Blank --> NULL3) In the case of Non-NULL Non-Blank values that conflict (aredifferent) leave existing value (even if NULL or Blank)For example:Given the following rows:Symbol Identity IdSource Exchange Type SubType Name-------- ------------ --------- --------- ------- ---------------------------TZA 901145102 CUSIP XNYS Stock NULL TV AZTECATZA 901145102 NULL NULL NULL NULLWSM 969904101 CUSIP XNYS Stock NULL WILLIAMSSONOMAWSM 969904101 NULL XNYS Stock NULLWILLIAMS-SONOMAWSM CUSIP XNYS Stock Common NULLWSM NULL CUSIP XASE Stock NULL WILLIAMSSONOMATYC 902124106 CUSIP XNYS Stock NULL TYCOTYC 902124106 CUSIP XNYS Stock NULL TYCOINTERNATIONALI am looking for the following results ('*' indicates changed value)Symbol Identity IdSource Exchange Type SubType Name-------- ------------ --------- --------- ------- ---------------------------TZA 901145102 CUSIP XNYS Stock NULL TV AZTECATZA 901145102 *CUSIP *XNYS *Stock NULL *TV AZTECAWSM 969904101 CUSIP XNYS Stock *Common WILLIAMSSONOMAWSM 969904101 *CUSIP XNYS Stock *CommonWILLIAMS-SONOMAWSM *969904101 CUSIP NULL Stock Common NULLWSM *969904101 CUSIP XASE Stock *Common WILLIAMSSONOMATYC 902124106 CUSIP XNYS Stock NULL TYCOTYC 902124106 CUSIP XNYS Stock NULL TYCOINTERNATIONAL
I have some problems with our database which is growing too large, and was hoping someone might have some tips on what I can do!
I have about 100 clients, each logging about 10 000 rows of status logs a day. So after just a few days the db is growing very large.
At present it's manageable, since I don't need to "dig" into the logs more than a few times a day. The system it self is not affected by the size of the log or traffic on the server. But it will increase to about 500 clients in 2004, and 1000-1500 in 2005. So I really need a smarter solution than what I have today to be able to use the log efficiently.
98-99% of these rows are status-messages which are more or less garbage during normal operation. But I still need to keep them in case an error occurs, and we need to go back an hour or two (maybe a day) to see what went wrong. After 24-48 hours these 98-99% are of no use. I do however like to keep the remaining 1-2%, they are messages like startup, errors, etc. Ideally they should be logged in two separate tables by the clients, but unfortunatelly I cannot make the clients change their logging.
This presents problems on multiple levels. Mainly in searching, which often times out, but also with backup and storagespace. At the moment I check the system for errors, and every other day I just truncate the log-file. It works, but it's not exacly elegant......
The server is a 1100 MHz P3 / 512MB / Windows 2000 Server / SQL Server 2000. Faster hardware would help, but the problem is more of a "bad design" than "slow hardware" problem.
My log is pretty simple, as follows:
LogId - int - primary key - clustered index ClientId - int - index asc LogTypeId - int - index asc LogValue - nvarchar[2500], ikke index LogTimeStamp- datetime - index asc
I have deducted 3 different solutions:
Method 1: Simply run "Delete from db_log where logtyipeid <> stuff_I_want_to_keep".
This is the simplest and the one i prefer, but it takes too long time to complete. Any tips to speed this process up?
Method 2: Create a trigger which runs something like "Delete from db_log where logtypeid <> stuff_I_want_to_keep and date < today_minus_two_days" every hour or so. This will ensure that the db doesn't grow to large. But if I'm away from work a few days we might loose data we'd wanted to keep.
Method 3:
Copy what I want to keep into another table, and empty the log. Sort of like "Insert into db_log_keep stuff_to_keep; drop db_log; create table db_log; " (or truncate, but that takes a long time too)
But then I would be stuck with two log tables, "48-hour_db_log" and "db_log_keep". I could use a view to "union" them so they would appear as a single table, but that's not ideal either.
However, it seems as this method is what will work best for my set-up, unless there are other suggestions??
Method 4:
...eagerly awaiting ideas!!! :-)
(Also, whatever tips and/or links to info on maintaing VLDB's are greatly appreciated. )
I am still fairly new to SQL, having been tasked with creating a csv file from data now someone else has left.
I can do the csv export using sqlcmd and I have the query sorted and am pulling out the right data, but it generates two rows, as one of the tables has multiple records per cardholder. See the query below, I know there is a way of doing it with XML PATH, I think, but it has got me slightly confused.
I have 2 columns (ID, Msg_text) in a table where i need to combine every 3 rows into single row. What would be the best option i have? I know by using 'STUFF' and 'XML PATH' i can convert all the rows into a single row but here i'm looking for every 3 rows into a single row.
I have this query and it works except for I am getting duplicate primary keys with unique column value. I want to combine them so that I have one primary key, but keep all the columns. Example:
Key column 1 column 2 column 3 column 4 A 1 1 A 2 2 B 2 3 B 5 5
it should look like:
A 1 1 2 2 B 2 3 5 5
Here is my query:
SELECT * FROM [TLC Inventory].dbo.['2014 new$'] WHERE [TLC Inventory].dbo.['2014 new$'].mis_key LIKE '2%' AND dbo_Product_Info#description NOT LIKE 'NR%' AND dbo_Line_Info#description NOT LIKE 'OBSOLETE%'
I have two SQL tables with the following structureInstructor tableINIDINNameINEmailInstructor PhotoIPIDINIDIPPhotoNow my new Instructor table got a new fieldINIDINNameINEmailINPhotoQuestion, how can I merge the Instructor Photo table IPPHoto field into Instructor table INPhoto field? Thank you.
We have a database. It is enabled for mirroring. We need to delete the old records. That is around 500k records from a table. But it has foreign key relation. How to do in Production servers these kind of deletes?
We have a large OLAP database, about 2.5 TB spread out over 3 data files on three different drives, and recently someone ran a query that created a table that continued to grow until the data files filled the available disk space (about 3 TB total - 1 TB per drive).
Tonight I plan on running a full backup (it's in Simple mode) and running a ShrinkFile on all three files sequentially with TRUNCATEONLY just so it will remove the space after the last extent. Any way to tell ahead of time how much space this will recover?
Granted running a DB Shrink is one of those things you just don't do, but this is a one-time shot and unavoidable to get the file size back under control.
I am attempting to do a rather simple purge task on a very large table. This task will need to take place daily and delete records older than 6 months out of the database. On first pass this will delete well over 130 million rows. I thought the best way to handle this is create a proc and call the proc from a SQL Agent Job that runs nightly. Here is an example of the script:
CREATE PROCEDURE usp_Purge_WCFLogger AS SET NOCOUNT ON EXEC sp_rename 'dbo.logs', 'logs_work' GO SELECT * INTO dbo.Logs_Backup FROM dbo.Logs_Work WHERE TIMESTAMP < DATEADD(month, -6, GETDATE())
I am fetching large amount of data from teradata to sql server using linked server. I am facing below query:
A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)
I have a table with about 466 Million rows. In this table there is a int column called WeeksToRetain as well as a EventDate column containing the date the row was inserted. I am trying to delete all the rows that that should be deleted according to the WeeksToRetain. For example, if the EventDate is 5/07/15 with a 1 in the WeeksToRetain column the row should be removed by 5/14/15. I am not sure what days SQL considers the beginning and end of the week. However the core issue I am having is the sheer mass of deletions I must do and log growth.
So I am trying to do the delete in batches. More specifically I want to load a temporary table with a million rows, then use the temporary table to load a sub temporary table with 100,000 rows and join this temporary table to the table I want to delete from looping through 10 times to get 1 million. The Logging.EvenLog table which is the table I'm trying to purge has a clustered index on EventDate (ASC). I would like to run this in a schedule job with enough time between executions for log backups to run.
DECLARE @i int DECLARE @RowCount int DECLARE @NextBatchDate datetime CREATE TABLE #BatchProcess ( EventDate datetime, ApplicationID int,
A New Monthly data is being loaded, checked and finally approved after 6 or 7 iteration before approval.Because of this iteration the monthly data set is being added then deleted then added then deleted few times.Because the table is big this process takes time, any thoughts on how to make the delete insert process faster.Keep in mind I cannot do much because it is a production table and is being access by other users to do other analysis.
Delete is done based on trx_date which is a year/month combo, like 201508.
The table has monthly sales by customer aggregated.
The table structure is:
CREATE TABLE [dbo].[Sales]( [batch_key] [int] NOT NULL, [Company_key] [int] NOT NULL, [customer_key] [char](22) NOT NULL, [Trx_Date] [int] NOT NULL, [account] [nvarchar](35) NOT NULL,