Sampling Data Set Via Integration Services Data Flow For Data Mining Models Without Saving Training And Test Data Set?
Nov 24, 2006
Hi, all here,
Thank you very much for your kind attention.
I am wondering if it is possible to use SSIS to sample data set to training set and test set directly to my data mining models without saving them somewhere as occupying too much space? Really need guidance for that.
I am wondering is there any way to select only a portion of a data set to train the mining model? In this case, I mean we dont need to split the dataset in advance, what I want to do is being able to select any random portion of a selected dataset to train a mining model. Any advices?
I am looking forward to hearing from you and thanks a lot in advance for your advices and help.
I am wondering where can I store my mining results in data mining engine? For example, I got mining results like accuracy chart, decision trees, and other formats of results based on different mining algorithms I used for my data mining, so where can I actually store the results for reporting service use later? Is it possible to do that in SQL Server 2005?
Thanks a lot for any help and guidance in advance.
Hi, I have just run a simple data set through a model to predict a simple true or false value (i.e. binary output) The Lift Chart/Mining Legend in Analysis Services shows three results Score, Population Correct (%), and Predict Probability (%)
Population Correct I beleive is the percentage of predictions it got right out of the total number of predictions it tried to make. Is this correct?
However, I cant work out how the other two are derived in particular the 'SCORE'. To give a live example the scores were as follows:
Model Score Pop Correct Pred Probability Decision Trees 0.83 76.59% 54.28% Neural Network 0.75 67.63% 50.05% Ideal Model 100.00%
Can anyone help with this and give a detailed explanation?
Hi, I'm trying to learn about analysis, integration and reporting services. I have install SQL server 2005 management Studio Express. but I cant find these in the Start menu as mentioned in the tutorial Click Start > SQL Server 2005 > Business Intelligence Development Studio.(for reporting services). what do I need to do? Please help me.
I have been trying to use SQL 2005 data mining for about 8 weeks. I am becoming frustrated because I am not able to make progress nor am I able to exploit the power of the system.
I need a training course! I have asked Microsoft in UK for recommendations but they have been unable to help. I have searched for courses in the UK and US without sucess.
I am coming to the Microsoft BI event in Seattle - will there be any opportunities there to get help or find help? (In Seattle I intend to concentrate on the Excel add ins)
when executing my data flow package that contains only one source and one destination
OLE db source -> SQL server destination
the following errors occurs in my output
Error: 0xC0202009 at Data Flow Task(infraction action), SQL Server Destination [3600]: An OLE DB error has occurred. Error code: 0x80040E14.
Error: 0xC0202071 at Data Flow Task(infraction action), SQL Server Destination [3600]: Unable to prepare the SSIS bulk insert for data insertion.
Error: 0xC004701A at Data Flow Task(infraction action), DTS.Pipeline: component "SQL Server Destination" (3600) failed the pre-execute phase and returned error code 0xC0202071.
i've checked the structure of my source and destination table but nothing seems to be wrong
if someone have ever faced these errors help me :D
I am having a question about automating data mining models managements. As we know in many businesses, patterns vary very frequently, therefore, the mining models created will need to be created again afterwards according to new rules appearing in the data. But can we make all these process automated like automatically assessing the mining model accuracy and automatically recreate the mining models based on predifined specifications? Would please any one here give me any idea about that?
We are using an OLE DB Source for the Data Flow Source and OLE DB Destination for the Data Flow Destination. The amount of data being moved is about 30 million rows, and it is gather using a sql command. There is not other transformations in between straight from one to another. The flow starts amazingly fast but after 5 million rows it slows considerably. Wondered if anyone has experienced anything similar with large loads.
Would please any expert here give me any guidance about what Data Mining tasks can be automated and scheduled via Integration Services Packages? Also, If we automated the tasks, can we also automatically save the results of the tasks somewhere? Like if we automate assessing the accuracy of a mining model, then we wanna know the mining model accuracy later, therefore, we need to save all these results from the automated actions. Is it possible to realize this?
Thanks a lot in advance for any guidance and help for this.
Hi ...I can't figure out how to put nested tables into the Data Mining Model Training Transform (SSIS). Can anybody help me? some example please...!!!?? Diego B.
Data flow A take data from the Excel File A, Data B from Excel File B, Data C from Excel File C. What I'd like to do is that if something goes wrong on Data Flow A I would be alerted but the package should continue to running. The same for the DataFlow B, if A it's ok go on, if B fail send me the mail but continue until the end (so running the Data Flow C).
I have a requirement to read an encrypted file as a data source. I am not allowed to save an unencrypted text file version on disc at any time for any length of time, therefore I created a custom source component that reads an encrypted csv file, decrypts it, and then passes each row of data to the pipeline and ultimately to an ole data destination. Basically it is just a text file reader with an added class that adds functionality that decrypts the file before the component sets columns or reads rows.
The custom component, “Encrypted File Source”, has a custom property “encryptionkey” with the encryption required flag set to true (code below) and is declared as eligible to be set in the expressions.
IDTSCustomProperty100 EncryptionKey = ComponentMetaData.CustomPropertyCollection.New(); EncryptionKey.Name = "EncryptionKey"; EncryptionKey.Description = "Secure String key value to decrypt the file"; EncryptionKey.Value = string.Empty; EncryptionKey.ExpressionType = DTSCustomPropertyExpressionType.CPET_NOTIFY; EncryptionKey.EncryptionRequired = true;
I want to be able to set the password for the encrypted file in the SQL Agent job that executes the SSIS project. This means I have an environment with a variable, “DataPassword”, that is set to sensitive. It maps to a Project parameter in the SQL Agent job that is also set to sensitive. And I now I want to access that sensitive Project Password inside my data flow, specifically in the Encrypted File source task that I created and set my EncryptionKey to that Project Parameter.
The problem is that SSIS says.
"expression cannot be evaluated. ... The Expression will not be evaluated because it contains sensitive parameter value "$Project::DataFilePassord" . Verify that the expression is used properly and that it portects sensitive information" ((Microsoft.DataTransformationsServices.Controls) "<v:shapetype coordsize="21600,21600" filled="f" id="_x0000_t75" o:preferrelative="t" o:spt="75" path="m@4@5l@4@11@9@11@9@5xe" stroked="f">
[Code] ....
I am using SQL Server 2012, on a windows 7 box with VS2010 premium.
In my SSIS Data Flow Task, I have a query that retrieves data based on a couple of date parameters. Is there a way we can pass/use the Variables defined in the SSIS package in the query ?
(I am assigning values to those variables from C# code)
The query should look like this:
select ordernumber, customerid from salesorder
where statecode=3 and datefulfilled between @variable1 and @variable2
I have a Data Flow Task within a ForEach loop container. The source of the flow is ADO.NET connection and the destination is a Flat File Connection. I loop through a collection of strings in the ForEach loop. Based on the string content, I write some data to the same destination file in each iteration overwriting the previous version. I am running into following Errors:
[Flat File Destination [38]] Warning: The process cannot access the file because it is being used by another process. [Flat File Destination [38]] Error: Cannot open the datafile "Example.csv". [SSIS.Pipeline] Error: Flat File Destination failed the pre-execute phase and returned error code 0xC020200E.
I know what's happening but I don't know how to fix it. The first time through the ForEach loop, the destination file is updated. The second time is when this error pops up. I think it's because the first iteration is not closing the destination file. How do I force a close of the file within Data Flow task or through a subsequent Script Task.This works within a SQL 2008 package on one server but not within SQL 2012 package on a different server.
I can't figure out how to put nested tables into the Data Mining Model Training Transform (SSIS). I can do a simple case table, but how do you get those nested tables with DM Training Transformation? Any ideas? Samples?
I am using the SharePoint adapters from Codeplex that allow me to use SharePoint source and destination tasks in SSIS for SQL Server 2008 and SharePoint 2010. I am able to pull the data from the SQL Server and insert it into the SharePoint List.
However, I prefer to just have fresh data every time, so I'd like to add a step to delete all the items in the list before inserting the new ones. Is there a way I can configure the SharePoint SSIS destination task to clear all the items before I insert new ones?
Using SSIS 2012 (within Visual Studio) on Windows 7.
Before allowing my Data Flow task to fire, I'd like to check the target table (OLE DB Destination) for a specific date value in a specific field. I've seen how the Lookup Task is commonly used to check for dupes before inserting, but I'm not able to use that method because the data value I want to search the table for is contained in a Global Variable (let's say "MyVariableDate").
Is there any way to check for any records in a target table where Date1 = MyVariableDate (i.e. scanning the entire table for any occurrence of MyVariableDate in the Date1 field)?
I have created an event that contains a Data flow tasks with OLE DB source & Excel Destination.
This event is executed/triggered based on an execute SQL task failure in the control flow Sequence container.
However, when I execute the Data Flow task of the Event Handler, it runs successfully but fails when I execute the whole package.
I get the below error message:
[OLE DB Source [21]] Error: SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER. The AcquireConnection method call to the connection manager "TK463DW" failed with error code 0xC0202009. There may be error messages posted before this with more information on why the AcquireConnection method call failed.
I have tried setting the property 'DelayValidation' to 'True' on all the Control Flow and Data Flow tasks on the package and on the Event Handler, but still I could not fix this.Not sure What I am missing.
Actor train nested table: ID MovieID Gender 1 1 F 2 1 M 3 1 F 4 1 F 5 2 M 6 2 M 7 2 F 8 3 F 9 3 F 10 4 M 11 4 M 12 4 F 13 4 F 14 5 F 15 5 M
We want to build a classifier model in order to predict the Class of a Movie based on the Gender of movie's actors. To deal with the nested table Analysis Services maps each record of the nested table to an attribute of the case table. These attributes are named Actor(n).Gender with n = 1..15, and so they are dependent on the nested table record numbers. Both Microsoft Decision Trees and Microsoft Naive Bayes algorihms use these attributes without any modification.
We are implementing a Relational Naive Bayes algorithm and we are planning to aggregate such attributes in order to make them independent of the nested table record numbers.
Next step we tried to predict some unseen cases and here we face with a very huge problem.
Lets take more two tables of unseen cases:
Movie test table: ID Class 6 + 7 NULL 8 NULL
Actor test nested table: ID MovieID Gender 1 6 F 2 6 M 3 6 F 4 6 F 16 7 F 17 7 M 18 7 F 19 7 F 20 7 F 21 8 M 22 8 M 23 8 F
Predicting the movie 6 Class is not a problem since the movie actors were included in the training dataset and when the records are mapped to attributes because they already exist in the model. But when you try to predict movies (7 an 8) with unseen actors all new attributes are simply ignored in the ALGORITHM:redict call (in_ulCaseValues is zero!) because they do not exist in the model!
I'm using Script Component to load data into Oracle DB due to the poor performance issue. Now, I found it will missing some data during the transmission. Please see the screenshot below:
I setup this package to import data from a Sharepoint list to a SQL Server data table. The primary key of my SQL table is mapped to the Title column of my Sharepoint list. There is a possibility that duplicate values will be entered in the Title field of the Sharepoint list. So when importing data into my table via SSIS, my package always error-out when there it comes across duplicate values. how you others have managed data integrity when importing from a Sharepoint list with the Title column being mapped to the primary key of a table.
I recently upgraded to on 2012 SP1 CU5 and have found the SSDT gui for SSIS to be almost unusable. I can't drag or resize items. Any time i try they either automagically shrink to the tiniest possible size, shoot off to some extreme or just shake uncontrollably I didn't have these problems on previous versions (dont remember what It was).
I have huge data and i am loading data from EXCEL to database table, after loading 80 percent data i am getting some error. My package got failed and it has lots of transformation and took around 6 hours to process completely because of that i don't want it to reload from start. if i run it again it should start from next record from where i got the error.
We have a single generic SSIS package that is used to import several hundred iSeries tables into SQL. I am not looking to rewrite the process. But I am looking for ways to improve performance.
I have tried retain same connection, maximum insert commit size, lock table (tablock), removed some large columns, played with the log file location and size, and now I am working to tweak the defaultbuffermaxrows.
To describe the data flow task - there are six data flows tasks (dft) working at the same time. Each dtf has their own list of iSeries tables and columns and the corresponding generic SQL table names. Each dtf determines their list of tables based on the number of columns to import. So there is dft30 (iSeries table has 1-30 columns to import), dtf60 (iSeries table has 31-60 columns to import), etc. The destination SQL tables are generically called Staging30, Staging60, etc. Each column in the generic Staging tables are varchar(100). The dtfs are comprised of an OLE DB Source and an OLE DB Destination.
The OLE DB Source uses a SQL Command from Variable to build a SELECT statement. The OLE DB Source uses a connection manager that uses an IBM iAccess IBMDA400 provider. The SQL Command ends up looking like this for the dtf30. This specific example is importing from the iSeries table TDACLR and it only has two columns so it will be copied to the Staging30 table.
select TCREAS AS C1,TCDESC AS C2,0 AS C3,0 AS C4,0 AS C5,0 AS C6,0 AS C7,0 AS C8,0 AS C9,0 AS C10,0 AS C11,0 AS C12,0 AS C13,0 AS C14,0 AS C15,0 AS C16,0 AS C17,0 AS C18,0 AS C19,0 AS C20,0 AS C21,0 AS C22,0 AS C23,0 AS C24,0 AS C25,0 AS C26,0 AS C27,0 AS C28,0 AS C29,0 AS C30,''TDACLR'' AS T0 from Store01.TDACLR
The OLD DB Source variable value looks like the following, but I am not showing the full 30 columns
select cast(0 AS varchar(100)) AS C1,cast(0 AS varchar(100)) AS C2,cast(0 AS varchar(100)) AS C3,cast(0 AS varchar(100)) AS C4,cast(0 AS varchar(100)) AS C5, ... cast(0 AS varchar(100)) AS C30.
The OLE DB Destination uses OpenRowSet Using FastLoad From Variable. The insert into Staging30 ends up looking like this.
Of course we then copy and transform the Staging30 data to the SQL table that equals T0.
But back to defaultbuffermaxrows. Previously the dtfs had default values of 10000 for DefaultBufferMaxRows and 10485760 for DefaultBufferSize. I added a SQL task to SUM the iSeries column sizes, TCREAS and TCDESC in this example, and set the DefaultBufferMaxRows by dividing the SUM of the columns max_length into 10485760. But I did not see a performance improvement. Do you think that redefining the columns as varchar(100) for the insert is significant? Should I possibly SUM the actual number of columns (2) as 2x100 or SUM the 30x100?
Basically i'm trying to create an SSIS workflow to download Sharepoint List data to SQL Server on a schedule of some kind.do we actually have to use the GAC install approach in order to get the Sharepoint List Destination and Sharepoint List Source entries to appear on the SSIS Project workflow entities?
I have to value [CreateDate] in the data pump of my Flat File Source into my OLE DB Destination SQL Server Table. With a Variable within the SSIS Package or with a Derived Column task within the Data Flow between the Flat File Source and OLE DB Destination?
I would like to know if there is any way to migrate third-party data mining packages with SQL Server 2005 data mining algorithms together then we can have a comparison among all of them to get the best results for training models.