Is It Possible To Perform Terms Lookup On Unstructured Files ?
Feb 5, 2007
Hi,
I need to categorize a lot of html or text files according to a list of terms and I wonder if terms lookup is adequate for this. The problem is that terms lookup can only take an Oledb source as input. My files can be up to 80 Kb big and aren't columns structured.
Should I import my files in a table ? But if so, how can I import a column with more than 8000 characters ?
I am designing a ssis package,This is intends to mine text data(Data extracted from websites). Term lookup/Term extraction has been used as tools for mining. I have lookup terms defined with me for reference table,but the main problem lie in extracting the nearby text/number/charcters to these lookup terms during mining. For example : I found noun "Email" 200 (frequency score) times in my text,Now I want to extract nearby email address(this is also true for PhoneNumber,Address attributes also).so how can I achieve this with SSIS. If u have some idea/suggestion to carry out this challenge with or without Term Extraction/Term Lookup,plz do write here.
Is it possible for the terms lookup function to manage the differences between US and english spelling ? For example if I search for the terms "color" and "categorization", I'd would like that the terms lookup also count the "colour" and "categorisation" occurences in the text.
I want with PowerPivot to tell what stage is in progress for project. I looked at RElatedTable, LOOKUPVALUES, but I can't find a way to associate this to get working. I got however the one telling me which projects are Completed.
Our ETL process involves some pre-load validation, and I'm wondering how best to implement it in SSIS.
Some details on my situation: I need to import 30 flat files with different data formats into 30 destination tables. In addition, these files share a common header and footer row format, and I need to validate these headers and footers before using the imported data downstream. (For example, the footer contains a record count, and fields in the header and footer should match some user variables.) My first approach was to write a Perl script that splits each file into three (header, data, and footer), but while that makes it easy to import the data section, it's more complicated to validate the header and footer and work them into the control flow. I think I'd also have to copy the same logic for all 30 data flows, which is less than ideal.
It looks like implementing this logic directly in SSIS is a little ugly (though that could be my lack of experience speaking). As I thought about this some more, I came up with a couple other solutions -- any critiques or comments?
1) Write a custom source adapter (which will probably contain the default flat file adapter) that knows how to validate my header and footer. I'd be able to read the file formats from an XML file, which might make my scripts more generic, and I might even be able to handle some custom data conversions more elegantly than I'm doing right now. (These files represent null numerics as whitespace rather than an empty field.)
2) Beef up the Perl splitter to validate the header and footer. If the cleanest approach is to say "assume that SSIS is only loading pre-validated data", this makes the problem entirely external.
Or am I entirely missing the mark here? Any thoughts?
I have a small number of rows in a dataset, Table 1. There is a CLOB on a large dataset, Table 2. They join on a PK. I would like to retrieve this CLOB and add it to the data flow for Table1. In short I want to emulate the following:
Table 1: Small table without CLOB, 10 rows. Table 2: Large table with CLOB, 10,000,000 rows
select CLOB from table2 where pk = (select pk from table1)
I want this to return the CLOBs for the small number of rows in Table 1. The PK is indexed obviously so it should be a fast look up.
Table 1 and Table 2 live on different Oracle databases. How do I perform this operation efficiently in SSIS? It seems the Lookup and Merge Join wont do this.
I have an unstructured SQL Function which takes around 2 hours to return a table with just nine hundred rows. I have deleted some text from code because it was more than the limit of this website.How to structure or optimize the below function to improve its performance.
I have a CSV file with roughly 6 million rows. The file is unstructured; that is, some rows have 5 fields, others have 15, and there are as many 50 fields in one row.
I am using bulk insert to read the entire file into a table in database, with each row being a database record. With that, I have one column that contains a row of comma delimited fields. All fields are character string and I want to find a quick way of parsing each row and placing each comma-delimited value in a column. For example:
Column CSVString contains the a CSV row (I don't know how many filelds (no. of commas + 1) in the row, but if the row contains 10 fields, I need to populate columns C1-C10. If the row has 15 fields, I populate columns C1-C15.
How can I do this in a very efficient way? I tried CTE but performance was not very good.
There are several terms using ms Server that I don't know and cannot find in my books. Does MS provide that, besides BOL where it is difficult to find good explanations or even find definitions?
thx,
Kat
ps. would be a nice feature if they don't have it currently.
Hopefully im asking this in the right place, sorry if its not, maybe you could point me in the right direction
I have been informed that use of MDF Files (SQL Server Express) Databases on the net was restricted as this was classed as multi connections and therefore was outside the free license agreement.
I am looking at commercially developing and marketing a web based system for with a relatively small database footprint (well under 1gb) with ASP.NET 2.0 and like the look of SQL Server Express.
Could anyone clear up whether or not this is allowed under the SQL Server Express terms of use, or point me in the direction of somewhere i can find information.
I have a table that contains 10 millions records. The following 2 statements, which one provide better performance? Frankly, i have no idea how to compare the execution plan...
basically, what I am trying to achieve to 2 types of search functions...
Search for All terms (easy and complete) and search for Any Terms...
the way I have gone about this so far is to in my asp.net app, split the search string by spaces, and then search for each word, and merging the resulting dataset into the main return dataset.
this, however has a few problems. the result dataset will contain duplicate values, and i am running queries in a loop.
What i am looking for is a one-stop-shop stored procedue that will split the search string, loop through each word, and add the results to a return table, ONLY if it does not exist already within the return table.
Can anyone point me in the right direction... basically with the splitting of the string and the looping through the words...the rest i think i can handle...
or any other hints/tips/tricks would also be helpful.
First and foremost, thanks for reading and responding!Does it matter how big a stored procedure is if you do things in the stored procedure such as:declare the parametersIF @Parm_Select = '<ALL>'do a select IF @Parm_Select <> '<ALL> and @Parm_Report = '1'do a selectIF @Parm_Select <> '<ALL> and @Parm_Report = '2'do select This goes on and on and on and I have written a couple of stored procedures that are about 1500 lines of code based upon parameters passed I do not create any tables - they are just all select statements based upon the parameters passed.I thought I was doing the right thing cause I did not want to have to write a procedure that called a procedure, (I read this and got confused on the return prarmeters cause there is a lot of data being returned from the select ----- I don't think I said that correctly! . I am just learning this SQL stuff and I it is cool and I am excited - but I don't want to develop any bad habits in the beginning - and I try to look these things up on the www - but I just don't get explicit answers from reading all of this stuff. Thank to all in advance!
We did some "at scale" fuzzy lookup tests today and were rather disappointed with the performance. I'm wanting to know your experience so I can set my performance expectations appropriately.
We were doing a fuzzy lookup against a lookup table with 25 million rows. Each row has 11 columns used in the fuzzy lookup, each between 10-100 chars. We set CopyReferenceTable=0 and MatchIndexOptions=GenerateAndPersistNewIndex and WarmCaches=true. It took about 60 minutes to build that index table, during which, dtexec got up to 4.5GB memory usage. (Is there a way to tell what % of the index table got cached in memory? Memory kept rising as each "Finished building X% of fuzzy index" progress event scrolled by all the way up to 100% progress when it peaked at 4.5GB.) The MaxMemoryUsage setting we left blank so it would use as much as possible on this 64-bit box with 16GB of memory (but only about 4GB was available for SSIS).
After it got done building the index table, it started flowing data through the pipeline. We saw the first buffer of ~9,000 rows get passed from the source to the fuzzy lookup transform. Six hours later it had not finished doing the fuzzy lookup on that first buffer!!! Running profiler showed us it was firing off lots of singelton SQL queries doing lookups as expected. So it was making progress, just very, very slowly.
We had set MinSimilarity=0.45 and Exhaustive=False. Those seemed to be reasonable settings for smaller datasets.
Does that performance seem inline with expectations? Any thoughts to improve performance?
I'm working with an existing package that uses the fuzzy lookup transform. The package is currently working; however, I need to add some columns to the lookup columns from the reference table that is being used.
It seems that I am hitting a memory threshold of some sort, as when I add 3 or 4 columns, the package works, but when I add 5 columns, the fuzzy lookup transform fails pre-execute:
Pre-Execute Taking a snapshot of the reference table Taking a snapshot of the reference table Building Fuzzy Match Index component "Fuzzy Lookup Existing Member" (8351) failed the pre-execute phase and returned error code 0x8007007A.
These errors occur regardless of what columns I am attempting to add to the lookup list.
I have tried setting the MaxMemoryUsage custom property of the transform to 0, and to explicit values that should be much more than enough to hold the fuzzy match index (the reference table is only about 3000 rows, and the entire table is stored in less than 2MB of disk space.
Say I want to lookup a value in another dataset, but there is a grouping that requires you to know what the values for each level is in order to get to the correct detail record. Can you still use the lookup function with more than one field to compare against? So for example
Department \___SalesPerson \___Measure
I want to be able to add a new row at the Measure level, but lookup each field from another dataset. In order to do that I will need the Department AND SalesPerson values to do the lookup, but I dont think the Lookup function will let us do that will.
Actually this is in regard to SCD Type 2 Dimension, Scenario is like that I am moving Fact table from some old source and I have dimensionA description value in fact which I want to replace with appropriate id from Dimension Table and that Dimension table is SCD Type 2 based on StartDate and EndDate and Fact Table doesn't contains direct date value rather there is timeId in Fact so to update the value in Fact table I have to Join Time Dimension table and other Dimension Table to replace fact Description with proper Id.
I am doing a lookup that requires mapping 2 columns in the column mapping section. When I do this, I get the error "Row yielded no match during lookup" . The SQL that I captured in SQL profiler does find the record when I run it in Management Studio. I have already tried trimming everything to no avail.
Why is this happening?
I tried enabling memory restrictions but then I my package hangs and I get a SQLDUMPER_ERRORLOG.log file with the following logged:
I have a Conditional Split with 3 outputs. On the first output I have a lookup, when I execute the package I have 56 rows going through the Conditional Split, all rows are then going to the 2nd and 3rd output but the lookup on the first output generates an error "Row yielded no match during lookup".
I don't understand why the lookup is generating an error while there is no row going through it.
I would like to know the best practice for running analysis service in terms of port usage. Is it better to run on a specific port or have dynamic ports ? We have clustered servers that run default on 2383 but not sure with non clustered what's the best way to get performance.
I have a table that contains words that will be used to search another table where FullText index has been created on searchable columns. I'm basically trying to run something like this:
SELECT t1.col1, t2.col3 FROM tbl1 t1, tbl2 t2 WHERE CONTAINS (t1.col1, t2.col1)
I know this won't work but is there a way to join these two tables so the words (t2.col1) can be passed as search conditions? There is no common key on both tables so normal join won't work. I'm trying to find a way to pass the search words from one table to another.
Sorry to ask a stupid question. I have SQL Server 2000 on SBS 2003. I can't find Perform.exe. Anybody know where it should be (I'm sure that you all do!),
Hi,any body pls help me with the concept called transaction ?Where excatly we are doin this ? In the SQL or through c# codings.i have 3 stored procedures, where i am calling this one by one to perform a single operation in the client side.if any one fails, due to some exceptions. it should rollback and come to the initial state.... How to perform this ? any ideas..
/*Reset Identity on tables with identity column*/ exec sp_MSforeachtable 'IF OBJECTPROPERTY(OBJECT_ID(''?''), ''TableHasIdentity'') = 1 BEGIN DBCC CHECKIDENT (''?'',RESEED,0) END'
-- City SET IDENTITY_INSERT City ON INSERT INTO Elbalazo.dbo.City ( [CityID] ,[CityName] ,[CountyID] ,[Active]) SELECT [CityID],[CityName],[CountyID],1 FROM [ElbalazoProduction].dbo.tbl_City SET IDENTITY_INSERT City OFF
-- State SET IDENTITY_INSERT [State] ON INSERT INTO Elbalazo.dbo.State ( [StateID] ,[State] ,[Active]) SELECT [StateID],[State],1 FROM [ElbalazoProduction].dbo.tbl_State SET IDENTITY_INSERT [State] OFF
-- NumberOfPeopleOption SET IDENTITY_INSERT NumberOfPeopleOption ON INSERT INTO [Elbalazo].[dbo].[NumberOfPeopleOption] ([NumberOfPeopleOptionID] ,[NumberOfPeopleNameOption] ,[Active]) SELECT [NumberOfPeopleID], [NumberOfPeopleName],1 FROM [ElbalazoProduction].dbo.tbl_NumberOfPeople SET IDENTITY_INSERT NumberOfPeopleOption OFF
-- DeliveryOption SET IDENTITY_INSERT DeliveryOption ON INSERT INTO [Elbalazo].[dbo].[DeliveryOption] ([DeliveryOptionID] ,[DeliveryOptionName] ,[Active]) SELECT [DeliveryOptionID], [DeliveryOptionName],1 FROM [ElbalazoProduction].dbo.tbl_DeliveryOption SET IDENTITY_INSERT DeliveryOption OFF
-- User SET IDENTITY_INSERT [User] ON INSERT INTO [Elbalazo].[dbo].[User] ([UserID] ,[FirstName] ,[LastName] ,[Address1] ,[Address2] ,[CityID] ,[StateID] ,[Zip] ,[PhoneAreaCode] ,[PhonePrefix] ,[PhoneSuffix] ,[Email] ,[CreateDate] ,[Active])
SELECT [CustomerID] ,[FirstName] ,[LastName] ,[AddressLine1] ,NULL ,[CityID] ,[StateID] ,[Zip] ,[PhoneAreaCode] ,[PhonePrefix] ,[PhoneSuffix] ,[EmailPrefix] + '@' + [EmailSuffix] ,[CreateDate] ,1 FROM [ElbalazoProduction].dbo.tbl_Customer SET IDENTITY_INSERT [User] OFF
-- EntreeOption SET IDENTITY_INSERT EntreeOption ON INSERT INTO [Elbalazo].[dbo].[EntreeOption] ([EntreeOptionID] ,[EntreeOptionName] ,[Active]) SELECT [EntreeOptionID] ,[EntreeOptionName] ,1 FROM [ElbalazoProduction].dbo.tbl_EntreeOption SET IDENTITY_INSERT EntreeOption OFF
-- CateringOrder SET IDENTITY_INSERT CateringOrder ON INSERT INTO [Elbalazo].[dbo].[CateringOrder] ([CateringOrderID] ,[UserID] ,[NumberOfPeopleID] ,[BeanOptionID] ,[TortillaOptionID] ,[CreateDate] ,[Notes] ,[EventDate] ,[DeliveryOptionID]) SELECT [CateringOrderID] ,[CustomerID] ,[NumberOfPeopleID] ,[BeanOptionID] ,[TortillaOptionID] ,[CreateDate] ,[Notes] ,[EventDate] ,[DeliveryOptionID] FROM [ElbalazoProduction].dbo.tbl_CateringOrder SET IDENTITY_INSERT CateringOrder OFF
-- CateringOrder_EntreeItem SET IDENTITY_INSERT CateringOrderEntreeItem ON INSERT INTO [Elbalazo].[dbo].[CateringOrderEntreeItem] ([CateringOrderEntreeItemID] ,[CateringOrderID] ,[EntreeItemID]) SELECT [CateringORder_EntreeItemID] ,[CateringOrderID] ,[EntreeItemID] FROM [ElbalazoProduction].dbo.tbl_CateringOrder_EntreeItem SET IDENTITY_INSERT CateringOrderEntreeItem OFF
select * from BeanOption select * from CateringItemIncluded select * from CateringOrder select * from CateringOrderEntreeItem select * from CateringOrderEntrees select * from City select * from Country select * from DeliveryOption select * from EntreeOption select * from NumberOfPeopleOption select * from [State] select * from [User]
Hi all, I hope you guys can help me with the following bit of T-SQL. I already have a solution but I really don't like it and I've been trying to find a simpler more elegant way of doing the same thing.
Firstly, let me present you with a brief explanation of what I am trying to do together with some sample data for you to play with and hopefully assist me in finding a better solution than the one I’ve come up with.
insert into #VehMake select 222, 'FORD' union all select 210, 'FORD (USA)' union all select 223, 'FORD (AUS)' union all select 269, 'HONDA' union all select 253, 'NISSAN' union all select 280, 'VOLKSWAGEN'
This contains various vehicle makes which I'm sure you'll recognise!
insert into #VehicleHistory (PersonId, VehMakeVehModel) select 1, 'FORD (USA) MUSTANG' union all select 2, 'HONDA CIVIC' union all select 3, 'NISAAN ALMERA' union all select 4, 'VOLKSWAGEN PASSAT'
As you can see, in the second table, the second column contains a string of the vehicle Make and Model in one string. What I need to do is to split the Make and Model in to separate columns with an update statement.
This seems easy enough with a simple LIKE comparison:
VehMakeVehModel like VehMake+' %'
....BUT if you notice, there are two records in the #VehMake table that are similar but not the same. These are the 'FORD (USA)' and 'FORD (AUS)'. The update statement would return two records from the #VehMake table when trying to match with the first record in my #VehicleHistory table.
As I said, I did come up with a solution but it seems over complicated and I have a feeling that there is a way of doing this with an update. Maybe use the LEN() function but I'm not sure.
Your help would be much appreciated.
BTW, once I've identified the correct Make, I can easily populate my model as all I have to do is use the replace function on VehMakeVehModel column and remove the matched make to get the full model name.
Hope that makes sense and thanks for any help in advance.
I am looking to perform a calculation and enter the reult into a field within my table. The fields that I need to base the calculation on are all in one table (SALARY). The fields are: SALARY and BASIC_HOURS and the result is to be entered into field HOURLY_RATE. The actualy calculation to be preformed is: