How To Put Few Hundred Million Strings In Proper Case
Sep 16, 2015
First off, I know this is a presentation issue. Second, no, I can't force a change on my source systems.
Some of the systems that send my BI application data, send that data in all upper case like so "JOHN DOE". We have this horrible SQL function that goes through and makes sure that the first letter in a word is always uppercase and the rest of the letters are lower case. So my results are "John Doe".
As you can imagine this is dreadfully slow when executed a couple of hundred million times, but what are my options?
I have not used Data Quality Services yet, but the chart in BOL says a DQS SSIS cleansing task can do 1 million records in 2 hours on a given set of hardware. That is still pretty horrible.
I suppose I could cobble together a Script task in SSIS, but I am pretty sure clumsy dotNet is not going to be much faster.
CREATE FUNCTION [dbo].[udf_ProperCase](@UnCased varchar(max))
RETURNS varchar(max)
as
begin
declare @Reset bit;
declare @Ret varchar(max);
How would I convert an expression like on of these to all upper case first letters with remaining letters lower case? VB has a function for that but sql doesn't seem to. I thought about having a loop go through each character to check for spaces. I've written a couple of similar pieces of code in VB when a while ago, but is there a better way? Thanks :)
Just a couple of typical examples of how the data should appear ~
I'd like to know how I can adapt this function so it will convert a scottish/irish surname (McDonald or O'Shea) when there is only the surname in the column
This is what I'd been using for multiple words (Ronald McDonald). But it won't work on just Mcdonald. I'm sure it's just a simple tweak, but it all looks Punjabi to me?
Thanks in advance!!
CREATE FUNCTION [dbo].[f_ProperCase](@Text as varchar(512)) RETURNS varchar(512) as BEGIN
DECLARE @Reset bit DECLARE @Ret varchar(512) DECLARE @i int DECLARE @c char(1)
SELECT @Reset = 1, @i=1, @Ret = ''
WHILE @i <= LEN(@Text) SELECT @c= SUBSTRING(@Text,@i,1), @Ret = @Ret + CASE WHEN @Reset=1 THEN UPPER(@c) ELSE LOWER(@c) END, @Reset= CASE WHEN CASE WHEN SUBSTRING(@Text,@i-4,5) like '_[a-z] [DOL]''' THEN 1 WHEN SUBSTRING(@Text,@i-4,5) like '_[a-z] [D][I]' THEN 1 WHEN SUBSTRING(@Text,@i-4,5) like '_[a-z] [M][C]' THEN 1 ELSE 0 END = 1 THEN 1 ELSE CASE WHEN @c like '[a-zA-Z]' or @c in ('''') THEN 0 ELSE 1 END END, @i = @i +1 RETURN @Ret
-- Test: SELECT dbo.f_ProperCase('it''s crazy! i couldn''t believe kate mcdonald, leo dicaprio, (terrence) trent d''arby (circa the 80''s), and jada pinkett-smith all showed up to [cHris o''donnell''s] party...donning l''oreal lIpstick! They''re heading to o''neil''s pub later on t''nite. the_underscore_test. the-hyphen-test.' ) END
I have a requirement to delete 1 Million records from a table having 10 Million data and it's being queried on 24/7 basis (don't have a downtime). how can I achieve that?
Hi all, I have a table with approx 75 million rows of names and addrersses in it that I am trtying to update...so far the update is running 5 hours and with no end in sight...a liitle background is that this is running on a quad zion 500 with 3 gb ram ands one 145 gb drive (boooo) without improving the hardware needs can i improve the performance...I have indexed all the where fields that i read on and only update the table but once or twice a month, but I do daily selects by zip or county (all indexed) i even have a composite key on phone and zip...
i have heard of horizontal partioning but i always thought that was reserved for archiving old transactional data that rarely gets read on....
when i performed a trace there are plenty of reads but no writes...is this normal during an update like this...
i have been running this proc for the past 7 HOURS!!!....any help is appreciated, since all i have is time at this point....
THANKS!!!!
--Set rowcount to 100000 to limit number of updates --performed in each batch to 100K rows. Set rowcount 100000
--Declare variable for row count Declare @rc int Set @rc=100000
While @rc=100000 Begin
Begin Transaction
--Use tablockx and holdlock to obtain and hold --an immediate exclusive table lock. This unusually --speeds the update because only one lock is needed. Update [2000] With (tablockx, holdlock) set [source] = '2000'
--Get number of rows updated --Process will continue until less than 10000 Select @rc=@@rowcount
I'm new to using a DB and have a few questions about what I'm trying to do. I have some historical options data and want to place it into a sql express database. (I understand I might need to use a none express version once the db gets to big.) A months worth of data is over 5.5 million rows of data. So six years worth is ~400 million rows. Is it possible to put this into a sql db and be able to search it very fast? I have a months worth in a db now and it is pretty slow. Should I use a new table for each month and then have 6 years * 12 month = 72 tables to increase the search speed? I search by date and stock_symbol and the data looks like this: Date, Stock_Symbol, Option_Symbol, Strike, BidPrice, AskPrice, Volume, OpenInterest, (and a few others) The select statement is simple: SELECT * FROM Options WHERE Date = @Date and StockSymbol = @Symbol Thanks
I am currently working on a simple page to insert 1.6 million UK postcode records into an SQL server table. The table has three columns for the postcode, longditude coordinate and lattitude coordinate. The data is sourced from a pipe (|) delimited txt file and inserted into the database using a FOR loop. The problem I have is that the page will hang after inserting only 10,000 records, the page displays either an invalid View State error or a page cannot be found error. Now I assume the viewstate error stems from the fact that there is a form on the page which simply contains a button to execute the script and a few labels to show the progress. But without the form and associated viewstate the insert still fails to complete.... any ideas?? Would I be better running this on a thread or should I just do it in stages and be patient. I have now modified the page to read the database on load and pick up from where it crashes?
I have a table that has 4+ million records. I need to update those records. I am facing some performance issue. Can someone please advice?
update stage set batch_status = 1 where update_status = 0
Update transaction Set aId = s.aId, b = s.b,
from stage s Where s.aId = transaction.aId and s.batch_status = 1
Update stage Set update_status = 1, batch_status = 2
where
batch_status = 1
When I run the above query with "set rowcount 1000", it runs in one minute. When I run the query for "set rowcount 10000", it runs in 1 hour 56 minutes. Can someone help me to optimize it?
Hey folks...So I have a table that looks like this:CREATE TABLE [tblStation] ([CAMPAIGN] [varchar] (8),[LISTNUM] [varchar] (10),[PHONE] [varchar] (10),[EVENTTIME] [datetime] ,[STATION] [int],[OPERATOR] [varchar] (16),[EVENTCODE] [varchar],[CALLSPAN] [decimal](18, 0),[FDISP] [int],[RECORDNUM] [varchar],[STC] [varchar],[PROMOC] [varchar],[EXP_CAMP] [varchar],[PROMO3] [varchar],[MAXATT] [char],[LISTNAME] [varchar],[SITENAME] [char],[Row_id] [int] IDENTITYIt's taking nine seconds to run the following command:SELECT count([fdisp])FROM [TrunkFiles_new].[dbo].[tblStation] WITH (NOLOCK)WHERE fdisp IS NULLAnyone familiar with a table of this size having performance likethis? The [fdisp] column has a non clustered index on it.Thanks in advance...
Hi all - I have posted inquiries on this rather vexing issue before, so I apologize in advance for revisting this. I am trying to create the code to add the parameters for two CheckBoxLists together. One CheckBoxList allows users to choose a group of Customers by Area Code, the other "CBL" allows users to select Customers by a type of Category that these Customers are grouped into. When a user selects Customers via one or the other CBL, I have no problems. If, however, the user wants to get all the Customers from one or more Area Codes who ALSO may or may not be members of one or more Categories; I have had trouble trying to create the proper SQL. What I have so far:Protected Sub btn_CustomerSearchCombined_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles btn_CustomerSearchCombined.Click Dim CSC_SqlString As String = "SELECT Customers.CustomerID, Customers.CustomerName, Customers.CategoryID, Customers.EstHours, Customers.Locality, Category.Category FROM Customers INNER JOIN Category ON Customers.CategoryID = Category.CategoryID WHERE " Dim ACItem As ListItem Dim CATItem As ListItem For Each ACItem In cbl_CustomersearchAREA.Items If ACItem.Selected Then CSC_SqlString &= "Customers.AreaCodeID = '" & ACItem.Value & "' OR " End If Next CSC_SqlString &= "' AND " <-- this is the heart of my problem, I believe For Each CATItem In cbl_CustomersearchCAT.Items If CATItem.Selected Then CSC_SqlString &= "Customers.CategoryID = '" & CATItem.Value & "' OR " End If Next CSC_SqlString = Left(CSC_SqlString, Len(CSC_SqlString) - 4) CSC_SqlString &= "ORDER By Categories.Category" sql_CustomersearchGrid.SelectCommand = CSC_SqlString End SubAny help on this is much appreciated, many thanks --
I have a tsql where I need to do a patindex on a variable and check if a record exists to meet the where clause for a IF statement below. What am I doing wrong
declare @l_orderid int set @l_orderid = 18 declare @l_SIGShort varchar(20) set @l_SIGShort = '~KOP~' -- KOP Orders print patindex ( '%~KOP~%', @l_SIGShort) if patindex ( '%~KOP~%', @l_SIGShort) <> 0 and (if exists (select * from orderoptions where orderid = @l_orderid and ordertype = 'KOP'))
How well SQL Server can support 300 million records... Any body is working on big database like this. can anyone give me some input on this. it's going to be 60GB database size.
In our database, we have a very large table that gets updated every morning, start of the day is copying 4 million rows from the fact table from previous date to today's date in the same table and then some other processing. It takes 1 1/2 to 2 hrs to do this. There is a dts package created to copy these rows into temp table and then to this fact table.
This table has more than 200 million rows
Any ideas on how to accomplish this without doing the copy twice and not running into locking problems.
i have a directory database with approx. 80 million records. i am feeding the database with bulk_insert. Indexing one of the fields took about 8 hrs. After indexing when i run queries with the indexed field the response time is under 1 sec. However if i run select queries with like on non-indexed fields it takes more than 2 mins. So i decided to index 4 other fields in the database and it looks like the indexing process is going to run for 2 days. i am a novice in SQL database design and i am not sure if this is the best way to index the table. i am just using create index. Any suggestions / advice welcome.
Hello, What is the fastest way to update 20million records in our database. I have tried to do a simple update statement like this: update trail_log with (tablockx, holdlock) set trail_log .entry_by = users.user_identity from users where trail_log.entry_by = users.user_id
but it take 10 plus hours to run since it cannot commit the transactions until the very end. So was was thinking that I need to commit in batch like after 50K but that is slow as well. Set rowcount 50000 Declare @rc int Set @rc=50000 While @rc=50000 Begin Begin Transaction update trail_log With (tablockx, holdlock) set trail_log.entry_by = users.user_identity from users where trail_log.entry_by = users.user_id and trail_log.entry_by not like '%[0-9]%' Select @rc=@@rowcount --Commit the transaction Commit End go I have let the above statement run for 1.5 hours and it only update 450000 rows. Any ideas... Maybe I'm doing it wrong. Please Help!!
Hello,We maintain a 175 million record database table for our customer.This is an extract of some data collected for them by a third partyvendor, who sends us regular updates to that data (monthly).The original data for the table came in the form of a single, largetext file, which we imported.This table contains name and address information on potentialcustomers.It is a maintenance nightmare for us, as prior to this the largesttable we maintained was about 10 million records, with lesscomplicated updates required.Here is the problem:* In order to do the searching we need to do on the table it has 8 ofits 20 columns indexed.* It takes hours and hours to do anything to the table.* I'd like to cut down as much as possible the time required to updatethe file.We receive monthly one file containing 10 million records that arenew, and can just be appended to the table (no problem, simple importinto SQL Server).We also receive monthly one file containing 10 million records thatare updates of information in the table. This is the tricky one. Theonly way to uniquely pair up a record in the update file with a recordin the full database table is by a combination of individual_id, zip,and zip_plus4.There can be multiple records in the database for any givenindividual, because that individual could have a history that includesmultiple addresses.How would you recommend handling this update? So far I have mostlytried a number of execution plans involving deleting out the recordsin the table that match those in the text file, so I can then importthe text file, but the best of those plans takes well over 6 hours torun.My latest thought: Would it help in any way to partition the tableinto a number of smaller tables, with a view used to reference them?We have no performance issues querying the table, but I need somethoughts on how to better maintain it.One more thing, we do have 2 copies of the table on the server at alltimes so that one can be actively used in production while we runupdates on the other one, so I can certainly try out some suggestionsover the next week.Regards,Warren WrightDallas
I have a sql script that updates records in a table with 40 million records.
There is some functionality in the script that could be put away in functions for code reuse/elegance.
Functions would cause execution overhead.
What else could I use besides functions that would allow me the code reuse and not compromise the execution over head? Is there any thing like includes in TSQL that would allow me to do so?
Greetings, I seem to be getting a problem during installing SQL 7.0 over the SQL 7.0 beta. It tells me that there are ODBC components need to be upgraded and they are read only ... I can find no way of changing this.
Alternatively if I remove the beta and then install the proper version, will all the old db created under the beta still be recognised ?
Kris Klasen
Act. Manager, Data Warehouse Project Information Management Branch Department of Education
I have been told that simply stopping the SQL server service and backing up the data directory is all I have to do to do a backup of my data. Is this accurate?
Hello: a nice simple question (I hope). Is there a MS SQL equivilant to PROPER (string) which would return "Fred Bloggs" from "FRED BLOGGS" and equally from "FrEd bLoggs" ? I cant find such ....
I am no stranger to Databases, I worked a lot with MySQL but never really cared about proper DB design as long as it worked. Now I am playing with SQL in a ASP.NET project and want to get things done the right way.Let's say I have a Movies database. My movies can have multiple genres so I set my tables up like this:
[Movies] MovieID MovieName MovieRelease
[code]....
Is this the proper way of doing things? The problem with this is when I want to enter a record manually I have to know the ID of the movie and the ID of the Genres of the movie. And what about naming conventions? By default the identifier is always Id, from my MySQL experience I liked naming it like the table, same goes with other columns. This is my T-SQL code for above tables in VS-2013.
CREATE TABLE [dbo].[Movies] ( [MovieID] INT IDENTITY (1, 1) NOT NULL, [MovieName] VARCHAR (50) NOT NULL, [MovieRelease] NUMERIC (18) NOT NULL, CONSTRAINT [PK_Movies] PRIMARY KEY CLUSTERED ([MovieID] ASC)
I read some questions where questioners ask "Sometimes client gives data where dates are expressed as float or integer values. How do I find maximum date?".
Ex March 02, 2006 can be expressed as 02032006.0 020306 2032006 20306 020306.0000 2032006 Assuming the values are expressed in dmy format
The possible way is convert that value into proper date so that all types of date related calculations can be done Create function proper_date (@date_val varchar(25)) returns datetime as Begin Select @date_val= case when @date_val like '%.0%' then substring(@date_val,1,charindex('.',@date_val)-1) else @date_val end return cast( case when @date_val like '%[a-zA-Z-/]%' then case when ISDATE(@date_val)=1 then @date_val else NULL end when len(@date_val)=8 then right(@date_val,4)+'-'+substring(@date_val,3,2)+'-'+left(@date_val,2) when len(@date_val)=7 then right(@date_val,4)+'-'+substring(@date_val,2,2)+'-0'+left(@date_val,1) when len(@date_val)=6 then case when right(@date_val,2)<50 then '20' else '19' end +right(@date_val,2)+'-'+substring(@date_val,3,2)+'-'+left(@date_val,2) when len(@date_val)=5 then case when right(@date_val,2)<50 then '20' else '19' end +right(@date_val,2)+'-'+substring(@date_val,2,2)+'-0'+left(@date_val,1) else case when ISDATE(@date_val)=1 then @date_val else NULL end end as datetime ) End
This function will convert them into proper date select dbo.proper_date('02032006.0') as proper_date, dbo.proper_date('020306.000') as proper_date, dbo.proper_date('02032006') as proper_date, dbo.proper_date('020306') as proper_date, dbo.proper_date('20306') as proper_date, dbo.proper_date('020306') as proper_date
Apart from converting integer or float values to date, it will also convert date strings to date Select dbo.proper_date('March 2, 2006') as proper_date, dbo.proper_date('2 Mar, 2006') as proper_date, dbo.proper_date('2006 Mar 2') as proper_date, dbo.proper_date('2-Mar-2006') as proper_date, dbo.proper_date('3/02/2006') as proper_date, dbo.proper_date('02-03-2006') as proper_date, dbo.proper_date('2006/03/02') as proper_date, dbo.proper_date('March 2006') as proper_date, dbo.proper_date('2 Mar 2006') as proper_date
What is the proper way to return the identity of a newly inserted row from a stored procedure? Using a return value or a select statement? (I guess as an output parameter should also be considered...) As in
RETURN SCOPE_IDENTITY()
or
SELECT SCOPE_IDENTITY()
What are the pros/cons of using one approach over the other?
I have two xml queries that take long: the 1st query takes about 5 minutes (returns 700 rows) and the 2nd query takes about 10 minutes (returns 4 rows). The total rows in the table is about 2 million. There are three secondary indexes: Property, Value and Path in addition to the clustered index on CardID and Primary XML index. Here is the table definition:
CREATE TABLE [dbo].[Cards] ( [CardId] [int] NOT NULL, [Card] [xml] NOT NULL, CONSTRAINT [PK_dbo_Cards_CardId] PRIMARY KEY CLUSTERED ([CardId] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
[code]...
Looking at the execution plan, the query uses the Primary XML Index even if I add any of the secondary xml indexes. My question is why does not the optimizer use the Property 2ndary index instead of the Primary XML Index? Microsoft recommends that creating a Property index for Value() method of the xml datatype would work to provide a performance benefit. What would be another alternative to make the query run faster?