Transact SQL :: Fast Data Loading With Partition Switching Strategy
Jul 28, 2015
I’m looking for clearity on partition switching. The idea is to use many BULK INSERT statements into table dbo.X_n in parallel and when BULK INSERT for table dbo.X_n is completed, switch dbo.X_n into dbo.bigdaddy. I think this is the fastest way to upload a couple hundred GB of data.
In learning about partition switching (in part) from The Data Loading Performance Guide under Partition SWITCH, I hear the instructions to say copy the main table exactly to become a target. But in that same step (#1), I read that we need to change the default file group of the target (dbo.X_n) from the default file group. Then it says I need to match indexes and lists the filegroup as something we need to match with the main table.
As an overview of the partition switching strategy, I think the whole point of BULK INSERT with partitioning is to have seperate files (in same group) to enable concurrent uploading where each table has its own file. Once the upload is completed to a table (dbo.X_n) then we do the partition switch into the main table (dbo.bigdaddy). The data we just uploaded doesn’t actually move, just the metadata for it.
When I read the instructions linked above, I hear “Don’t have the same filegroup on your target as the main table. You must have the same filegroup on your target as the main table.”
I’m looking for clearity on partition switching. The idea is to use many BULK INSERT statements into table dbo.X_n in parallel and when BULK INSERT for table dbo.X_n is completed, switch dbo.X_n into dbo.bigdaddy. I think this is the fastest way to upload a couple hundred GB of data.
In learning about partition switching (in part) from The Data Loading Performance Guide under Partition SWITCH, I hear the instructions to say copy the main table exactly to become a target. But in that same step (#1), I read that we need to change the default file group of the target (dbo.X_n) from the default file group. Then it says I need to match indexes and lists the filegroup as something we need to match with the main table.
As an overview of the partition switching strategy, I think the whole point of BULK INSERT with partitioning is to have seperate files (in same group) to enable concurrent uploading where each table has its own file. Once the upload is completed to a table (dbo.X_n) then we do the partition switch into the main table (dbo.bigdaddy). The data we just uploaded doesn’t actually move, just the metadata for it.
“Don’t have the same filegroup on your target as the main table. You must have the same filegroup on your target as the main table.”
I have been developing a genealogy application using a SQL Server 2000 database and ASP .NET 2.0. In this application a process, Ged.Parse, converts data from the GEDCOM standard format (a heirachical file format that looks as if it was designed for 80-column cards) into my SQL Server database. As we started to load reasonable quantities of data into the system we found that the on-line response became abysmal. This problem was fixed by defining a number of secondary indexes (response times dropped to under a second, from previously exceeding 2 minutes and often timing out). Unfortunately however the processing time of Ged.Parse then tripled, and it may now take up to an hour to process a GEDCOM. I believe that this is a byproduct of defining several indexes that are not needed by Ged.Parse itself, but which are of course maintained as Ged.Parse inserts new records into the database. I am wondering what my best strategy is, apart from putting Ged.Parse into a background task and just letting it trickle away. (I will probably do this anyway). What I'd like to be able to do is to have Ged.Parse load records without creating the secondary indexes, and then create the indexes for the newly-added records as a penultimate step just before it makes them available for general use. Of course there is no way that you can do this: records in a table are either indexed or they are not. Proposed change: recode Ged.Parse to load data into temporary tables, say NewPeople, NewFacts, etc., with these tables having only the indexes required by Ged.Parse. Then, as the last process in Ged.Parse run a SQL procedure with code like: - Insert into People Select * From NewPeople Delete from NewPeople etc This is a reasonable amount of programming, so before I make this change could somebody tell me: will this be significantly faster overall, or is this likely to make little or no improvement compared to the present process in which Ged.Parse loads data directly into People, Facts, etc? Two facts that may influence the answer. First, all record relationships are through GUIDs, so records in NewPeople, NewFacts, etc would already have their final key values. Second: although Ged.Parse needs to form relationships between records, these relationships are only within the new records (created from the same GEDCOM), and Ged.Parse does not need to relate any of these new records to earlier records. Thank you, Robert Barnes.
When you load the data into a new partition table, can it to done online without any downtime? because I have few tables that are around 250 gigs and more.
I am writing a query where I am identifying different scenarios where data changes between one week and the next. I've set up my result set in the following manner:
PrimaryID Field Changed Previous Value New Value 10003 SKUName SKU12345 SKU56789 10003 LocationId Den123 NYC987 etc...
The key here being that in the initial resultset ID 10003 is represented by one row but indicates two changes, and in the final output those two changes are being represented by two distinct rows. Obviously, I will bring in the previous and new values from a source.
Is this possible? I have tried using row over partition but I'm not sure how group it correctly, so basically every time there is a new 1 in new_commsstream within a personid the row number goes up by one.
I am new to Partitioning tables. My scenario is as listed below.
I am getting Monthly Transaction data on Every First Monday of the Month and I want to do partition for those data.
For Example: Let's say I will get my next monthly data on August 3rd 2015 which is First Monday of the month of August.
I want those Transaction data to go in new partitioned FileGroup in my existing partitioned table. How can I do partition for this kind of scenario ? Can we create one or multiple Stored Procedure which will create New Partition and load data in that partition ?
FYI, this monthly data will be loaded in Staging table and that table has LoadDate column which will have 2015-08-03 in it. I am using SQL 2012 Enterprise edition.
WITH summary AS (SELECT tu.SequenceNumber, tu.trialid, tu.SBOINumber, tu.DisplayFlag,
[Code] ....
I am having trouble with the RowNumber Over Partition By portion of the query. I would like the query to return only the first occurrence of each sboinumber in the table for each trial id. It is only giving me the first occurrence of each sboinumber. I tried including the trialid in the partition by clause, but that is not working.
I have a non-partitioned table (TableToPartition) and I want to apply an existing partition scheme (PartSch) to it using a query. I didn't find any option so I used the StorageCreate Partition wizard to generate the script.why this clustering magic needed if it is dropped at the end? Isn't there another way without indexing to partition a table, say something with ALTER TABLE? (SQL Server 2012)
BEGIN TRANSACTION CREATE CLUSTERED INDEX [ClusteredIndex_on_PartSch_635694324610495157] ON [dbo].[TableToPartition] ( [ID] )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PartSch]([ID]) DROP INDEX [ClusteredIndex_on_PartSch_635694324610495157] ON [dbo].[TableToPartition] COMMIT TRANSACTION
Don't know if this is the right forum to be asking this, but I'll give it a try...
I'm relativelly a beginner in SQL Server and T-SQL in general. The problem I'm trying to solve is the following:
The big picture is that I have data coming from different data sources which I need to store on a database for later reference. Each data source might have a different set of measurements. For example, data source 1 might log Pressure and Humidity while data source 2 logs Pressure and Temperature. Once the data is present on the DB, the users can go ahead and retrieve data for a given [datasource/measurement/time interval] to generate reports or charts.
My implementation so far consists of two tables: series_info and series_data. series_info holds general information for a given series of measurements for a given data source (Pressure for data source 1, Pressure for data source 2, Humidity for data source 1 and Temperature for data source 2, in our example). Each series has a bigint index as primary key.
The table series_data contains all data relative to the series from series_info. Each piece of data has a bigint as a primary key, an associate time (which is always crescent) and a foreign key to the series it represents (in series_info).
Alright, everything is cool so far. However, whenever a user wants to retrieve data for given [data source/measurement/time interval], this takes very long, since all data is interposed in series_data and for every search it's necessary to find where the desired data actually lies.
One obvious solution for this would be to dynamically create a new table to hold the data for each series, but that would just make my database disorganized, since there would be thousands and thousands of tables.
Another thing that comes to my mind is to create a table with information of where lies the data for a given [data source / measurement] for given dates. So when the user requested data for a given [data source/measurement] between, say, january and february, we would first look at this intermediate table and find out that the data lies between indexes 1000 and 2000 on the series_data table, so the next SELECT command to series_data would already contain a restriction like WHERE index>=1000 and index<=2000. This should probably improve the speed of retrieval.
What do you guys (or girls) think? Maybe there's simply a classical solution for such a case.
Regarding SQL Server data, I am looking to implement the beset Data-Archive and Purge policy. Normal, we do SQL Backups and keep the history for some period , for example, 8 weeks, so we can go back and restore any data point in time upto 8 month in past. and we also do Tape backups.
Question is Where can I get nice article or documentation on this to best design such policy where I make sure that I am covered for point in time recovery of database (which is sql backups) and point in time recovery in far past, say, 3 years ago using tape backups, and I need to make sure that I don't repeat the same efforsts.
I have a stored procedure that attempts to INSERT @BatchSize number of records at a time into a table. Currently, I have @BatchSize set to load 50,000 at a time. The table I am inserting from has a little over 67,000 records.
When I execute the procedure with NOCOUNT left off, the procedure seems to run indefinitely, and the count of records returned surpasses what I have in the source table. However, only 50,000 records are inserted into the table.
Below is my code:
begin try
--error catching variables:
declare @Error_NumberLocal int ,@Error_MessageLocal varchar(4000) ,@Error_SeverityLocal int ,@Error_StateLocal int ,@Error_ProcedureLocal varchar(200) ,@Error_LineLocal int ,@User_NameLocal varchar(200)
[code]....
What is the problem with the looping structure that would cause this issue?
Hello everybody We need to move table T1 from database A to T1 database B on same server
size of table T1 15 GB and 40000000 rows
database B just created and will act as warehouse
could it be done simply by 1.creating table T1 on db B and then 2.set db to simple recovery 3. insert into B.dbo.T1 select * from A.dbo.T1 4. create all the indexes on table T1 in db B
I'm currently working with a 10 million plus row database with the dataresiding on a Unix box with Cache 5.0. The problems is that it can take fivedays to pull one table from Cache to SQL 2000 using the ODBC connectionprovided by Cache in a SQL 2000 DTS package. I think the real problem isconverting the data from the post relational format (Cache) to a relationalformat (SQL 2000)???Does anyone have any ideas / suggestions on how to speed this transfer ofdata? I'm very new to Cache and any help would be greatly appreciated.Thanks,-p
For this scenario, what is the best method of exporting data to sql 2005.
I want to export data from desktop app across internet to sql which can do on a row by row basis, but this is very slow and if the connection goes down halfway then pretty much buggered.
What is the best, reliable and fastest way to copy data across internet (several thousand rows), I have read about Bulk Insert etc... but also how would get around an upload and crashes half way, is there a way of uploading and until the whole upload goes through then the data is inserted into the database.
Does anyone know how to upload (bulk) data from a client (written in Excel VBA) to a remote SQL2000 database? Of coarse I tried "INSERT INTO" and rst.addnew but I noticed this is much, much slower as downloading from the same remote database.
I have below DB structure in MSSQL for a small application which follow relational approach. Data retrieval (for Hostels) will need several Join, may be Key-Value approach where data retrieval will be fast.
Recently i've been working on a new project that would partition a large table 2 smaller tables. I then create a view to union the 2 smaller tables(table A, B). I've been getting a strange error when i try to update, insert, delete a record through the view. "View needs partitioning column"....i find this strange. Both of my table have a cluster primary key consisting of 3 columns, and one of the 3 columns(date field) consist of a check constraint. The constraint is used to determine what record goes into which table. Am i missing anything else? The really strange part is sometime it works, and sometimes i get the error message.
Please see below sample from BOL + sample of execution plane .
I would like to ask what is the way to avoid the optimizer scan tables out of the scope (I would expect that the only table for this query will be SUPPLY1)
Thanks, Eyal
--This example uses tables named SUPPLY1, SUPPLY2, SUPPLY3, and SUPPLY4, which correspond to the supplier tables from four offices, located in different countries/regions. USE tempdb GO
--create the tables and insert the values CREATE TABLE SUPPLY1 ( supplyID INT PRIMARY KEY CHECK (supplyID BETWEEN 1 and 150), supplier CHAR(50) ) CREATE TABLE SUPPLY2 ( supplyID INT PRIMARY KEY CHECK (supplyID BETWEEN 151 and 300), supplier CHAR(50) ) CREATE TABLE SUPPLY3 ( supplyID INT PRIMARY KEY CHECK (supplyID BETWEEN 301 and 450), supplier CHAR(50) ) CREATE TABLE SUPPLY4 ( supplyID INT PRIMARY KEY CHECK (supplyID BETWEEN 451 and 600), supplier CHAR(50) ) GO --create the view that combines all supplier tables CREATE VIEW all_supplier_view AS SELECT * FROM SUPPLY1 UNION ALL SELECT * FROM SUPPLY2 UNION ALL SELECT * FROM SUPPLY3 UNION ALL SELECT * FROM SUPPLY4 GO
I have partitions that I have filled with data. I am not trying to figure out exactly how much data the partitions contain, and therefore I will be able to see if any of them are close to hitting their autogrow conditions. If I were looking at a single unpartitioned table, then I could maybe look at the table properties to determine data and index sizes, and compare that to the size of the mdf file size, but for partitions, then I am not sure how I would query this information out. Any pointers on how this information could be queried out of the system?
I have a sql server 7 running on a machine with two disk partitions (D: and E:).
The data files xxx.mdf and xxx.ldf are stored in D:, which has very few space available. I want to copy these files to E: but I get an error saying that it is not possible to change the source file of a database. Is it possible to do it or do i have to create another data file in E: and keep the old one in D:?
I am trying to understand, when would I do a vertical partition in a Dimensional Data Warehouse ? What are the things I need to consider, before I take the decision?
Hi Column with TEXT datatype is not stored in the same data row any way. I am wondering if there is any performance gain to put it in a seperate table. Thanks
Does anyone have a helpful link for using the partition processing data flow task in SSIS? I am trying to process a monthly partition from within my package and am getting the following error:
Error: 0xC113000A Errors in the high-level relational engine. Pipeline processing can only reference a single table in the data source view.
If anyone has used this before and could point me in the right direction, I would appreciate it.
I'm using Script Component to load data into Oracle DB due to the poor performance issue. Now, I found it will missing some data during the transmission. Please see the screenshot below: