Extract Specific Words From A Free Format String

Feb 27, 2008

I am required to send an XML file of our clients to head office in Belgium for comparison against a database of known undesirables. The data is in a legacy system with a custom database so I have created an SSIS package that extracts the tables I need into SQL Server and have developed a program that reads from a text source and creates the XML then Secure FTPs it to Hong Kong who will handle it from there.

My problem lies in actually extracting enough data to avoid too many false positives. The scanning will check name, identity (passport number, etc.), town/city and country. We don't hold an identity number and the town/city and country are buried in free format fields. A quick analysis of the 419,000 records shows that the spelling is terribly unreliable, too. In most cases country has not been entered because the clients are local and even when they are overseas, sometimes only the city has been entered. That is often misspelt, too e.g. Kuala Lumpar or Melboure.

The addresses are held in 3 equal length fields called Address_1, Address_2 and Address_3. There's no guarantee that I will find the town/city or country in any particular one of these fields. In some cases, the street number and name are in Address_3 because the first two hold a company name and a C/O line.

So I'm not going to fret over the ones where the address information is nonsense or missing but I would like to try and extract valid country names and town/city names, where present and this is where I get stuck. I'm from a COBOL programming background and although I'm loving getting used to the power of SQL, I'm still a bit stumped when I come across a problem like this probably because I keep thinking of the solution in procedural terms.

I have a feeling that the solution will be to create two separate reference tables, one of towns/cities and the other of countries. I would then somehow search the 3 fields looking for those keywords and if found, entering them in the appropriate part of the output text file to represent town/city and/or country. I did also think about destringing to find the separate words but that doesn't help where the name consists of two words such as NEW ZEALAND.

I would love to hear from anyone who has dealt with a similar problem and has a neat solution to this using SQL.

View 4 Replies

Extract Specific Words From A Free Format String

Extract Specific Word From String

Transact SQL :: Extract ID And Specific Part Of A String From Column

String Index Function (substring / Charindex) - Extract Specific Characters From Data

Repost!: Extract Data Meeting Specific Criteria.

Transact SQL :: Sequence Of Characters - Extract Some Specific Values

GETALLWORDS Inserts The Words From A String Into T

Efficiently Searching Multiple Words In A String

Extract Data In Insert Into... Statement Format

Extract File From Database (Stored In BLOB Format)

Extract INT From String?

Free User-Defined String Functions Transact-SQL

Free User-Defined String Functions Transact-SQL

Date In String Format Has To Be Changed Datetime Format

Transact SQL :: How To Format A String In A Format Coming From A Table

Extract Value From Middle Of String

String Extract From Text

How To Extract Part Of A String

How To Extract Part Of String

Search String And Extract To Another Table

How To Extract Random Word In A String

Extract A String In A Stored Procedure

Transact SQL :: Extract String With Delimiter

SQL 2012 :: Output In Specific Format

Exporting File Name In Specific Format

Get Recent Event In Specific Format

SQL Server 2008 :: How To Extract Part Of A String

How To Extract Part Of A String In Column Results

Extract Numbers Or Letters From Mixed String

Extract Substring From String(Regular Expressions)

Extract Data From Middle Of String In SQL Server

T-SQL (SS2K8) :: How To Use Substring And Charindex To Extract Desired String

T-SQL (SS2K8) :: Extract String - Variable Sizes With Breaks?