DataTune – the first data cleansing software
Author: Michael Haephrati
The first data cleansing system created by TargetData was in 1998. Michael Haephrati has developed the first “TargetData” software. Haephrati has developed ThiS – Targeted Human Intelligent Scoring system used to evaulate financial strength of people based on statistical geo-data.
The purpose of the TargetData software was initially different and the focus was on displaying statistic and demographic data about the state of Israel.
During 2001-2003 Target Data have developed DataTune, a data cleansing system, under the brand name of TargetData, which is part of Target Eye Limited.
Among their clients were government institutes, large corporations, Microsoft Israel, Bezeq Call, and People and Computers.
The Target Data services include:
- Data enhancement
- Filtering redundancies (duplicates)
- Data conversion
- Unification and filtering of records (including between various databases)
- Location of potential customers (for Database marketing).
- Integration of various information systems
- Generation of automatic reports
- System analysis
- Development of efficient tools for normal and correct entry of new data
- Location of updated information about private individuals and companies (including investigations, sending people into the field, etc., that is, in case there is no updated information)
- Development of customized software for execution of current activities on the customer’s premises, without the need for external help
- Improvement of information systems performance
- Development of software and tools based on Excel
- Training and assimilation
Information systems and the importance of maintaining their serviceability
One of the most important assets of the organization, if not the most important, are the information systems. Most of the organization’s information is currently stored in its computer network. In most cases use is made of a number of information systems, that are frequently different from one another, thus creating the need for mutual synchronization and updating of items of information between the various databases.
Furthermore, since a significant proportion of the information is entered manually, a large number of errors arise, causing duplication and loss of information. As a result it becomes difficult to locate items of information at a later stage, overloading is produced, mail is returned, and malfunctions related to customers or suppliers occur.
Major importance is attached to the current updating of parts of the information without affecting the information systems in general. Most of the information items comprising the information systems require updating at some given frequency.
In fact, the frequency of updating, which differs from one item to another, also represents part of the problem. Some of the information items are related to the organization’s work:
- Customer databank.
An additional part, over which there is even less control, includes updating of:
- Addressesfor example: a street name that has been changed.
- Phone numbersfor example: changes in Area Codes or changes in the first few digits of the phone number.
All these subjects require complex processes for the current enhancement of the information systems. In most cases it is necessary to occasionally or periodically refresh all the information systems, and to formulate procedures for maintaining the information systems constantly serviceable.
General data enhancement
The concept of data enhancement refers to a series of operations intended to significantly improve the quality and efficiency of the organization’s information systems.
The following is a description of these operations:
- Separation of fields
In some information systems, a number of fields are kept as single fields. This makes it difficult to perform updating, retrieval, and current maintenance.
For example, saving a customer’s address as a single field (i.e. “1020 Main St. Appt #5”), will make it difficult to locate all the customers living in a suitable street (i.e. Main St), or alternatively, will make it difficult to update the street name, if replaced, in the entire records of all customers living in this street.
Target Data has developed a unique method for splitting fields into their components. This method comprises separating full addresses, street addresses, names of people and names of companies. In addition, this method may be used to perform further manipulations of data structures of all kinds, at the customer’s request.
- Updating addresses
From time to time street names are changed. Furthermore, most information systems contain different versions of streets and towns names, as well as different syntax for address format.
Using Target Data’s exclusive method, a process has been developed to attach to each address a unique identifier, based on the coding system employed by the Ministry of the Interior in each country, giving each town and each street it’s unique numeric code.
The process consists of interpretation of the address in the information system and its conversion into town code, street code, house number, entrance number, apartment number, and floor number.
These data permit immediate attachment of the precise, official address: town name, street name, Zip Code, etc. whenever the street name is changed in any town. This process permits easy and effective attachment of the postal code to every address
- Location of addresses containing the names of well-known places instead of the street name
In many cases the address field contains the name of a well-known place instead of the street name. For example, DATATECH, a firm situated in the Central Bus Station might mistakenly give its official address as follows: DATATECH, Central Bus Station, floor 5.
We shall identify a wide variety of places in major European towns, such as bus and railway stations, airports, cultural centers, squares, bathing beaches (which in certain cases form the address of hotels), as well as sites which do not belong to a town or a local authority (such as Highway projects). We can also identify shopping malls, commercial centers, industrial zones, etc.
- Interpretation of house numbers
In cases of an un-separated Street Address, we are ready to interpret and locate house numbers in addresses, using a variety of methods. In general, an address may contain a house number in several ways: house number/ apartment number, apartment number/ house number, and other combinations, such as floor number and entrance number. The most important of the data (an error in which may cause the mail to be returned) is the house number. We cross-check this number in the known range of numbers for that street. For example, Barnauer St. in Berlin should not contain house numbers greater than 200. In addition we check the apartment number and verify that it is logical, by comparing it with the number of floors usual in that district. For example, it is impossible to find an address on the tenth floor in a neighborhood of cottages.
- Data enhancement using an Error Bank
The Error Bank helps us to collect common errors and misprints in a databases, built up over the years we do this business. This bank permits us to identify common misprints and relate them to the correct name, which may be:
– Name of a street
– Name of a company
– Name of a town.
This is all based on the assumption that errors frequently repeat themselves.
- Data enhancement using Soundex-based algorithms
In order to identify unfamiliar misprints, we make use of Soundex-based algorithms that help us to identify misprints according to the language root of the names. According to this method, the words Meerdevoor street inDen Haag can be misprinted as Merdevur, Mardevoor, etc. and yet receive an identical root. This method helps to locate and correct difficult and strange misprints of names of companies, streets, and towns, as well as of people’s names, and to enhance the data in an automatic, rapid and quality manner.
- Attaching postal (zip) codes and other codes
We attach / verify the updated postal code to every address in the customer’s database in accordance with the updated table of postal codes of the postal authority of each country. We also attach the codes for streets and towns in accordance with the updated table of codes of the Ministry of the Interior in each country. This enables us to update the changes to street names published by the Ministry of the Interior from time to time.
- Enhancing names of companies and/or private individuals
This process is based on the technique of separating fields, developed by Target Data. At the end of the process an updated database is created permitting access to each element in the customer’s name (individual or company). This process permits locating duplications, locating family relations from the customer database and, if necessary, combining households. The process permits creating a single record for each company, and combining all the contact people with whom the organization is in contact into the main record. In this way the need is removed for updating separately details of information common to a company (address, fax number, Internet website), while at the same time access is retained to details of information specific to each contact person (extension number, or direct phone number, email address, etc.).
- Locating updated addresses and phone numbers of companies and private individuals
In our experience the best way to locate the most up-to-date address and phone number of the customer is to use Bezeq’s records. The file of records of residents is actually updated only after the citizen informs the Ministry of the Interior, and is not always up-to-date. In contrast, since most people own a telephone today (even if they are renting their homes), because of the reduced prices of phone lines, we have found that Bezeq’s records are more up-to-date.
We can update and verify addresses and phone numbers of private individuals and companies using this method, by querying several sources of information until we locate the most up-to-date address and phone number.
- Filtering redundancies
As part of the data enhancement services we offer facilities of marking, followed by filtering of redundant records. This service is implemented after data enhancement since only at this stage it is possible to identify duplicate records which originally contained apparently different data because of errors. Only the enhancement process will permit identification and correction of redundancies at a later stage.
For example, the firm of “NewCo” appears in the database as “New Co” because of a mistake in data entry, and also under its correct name. Only after enhancement of the company’s name, i.e. after correcting “New Co” to “NewCo” will the redundancy be identified.
After identification of redundant records, the report is sent to the customer, containing several options: unification of the information in each of the redundant records into a single record (i.e. sending details of the information appearing in separate redundant records, in order not to lose important information).
Another options is to ignore and delete one or more of the redundant appearances of the record.
The third option is to leave the records unchanged. (This option is intended for customers who prefer to delete these records by themselves at a later stage.) It should be noted that, even then, data enhancement permits fast and easy identification of redundancies by the customer himself, by making simple queries.
- Location and enhancement of names of company branches
One of the most frequently encountered problems in data enhancement is that of enhancing records containing names and addresses of company branches. In most cases the official publications and most of the databases contain a record of the company management and/or the head office only. On the other hand, a record giving details of one of the branches may appear in the customer’s databank.
For example, under the heading: Bank of America, there will actually appear one of the branches since this branch is the company’s customer. However, if another branch is added to the database of customers, we do not want this addition to be identified as a redundancy but as an additional record, preferably with a link to Bank Leumi as a major name for each of the branch records.
We are prepared to handle data related to the branches of:
We can attach the number of the bank and of the branch in each European country, the address, phone number and fax number. We will also give a unique Sort Code to each branch
We will attach the school network name (e.g., Ort), the classification (e.g., primary school), the address, phone number, and fax number.
- Health fund branches
We will attach the name of the fund, the branch name, the manager’s name, address, phone number, and fax number.
- Income tax branches
We will attach the name of the branch, address, phone number, and fax number.
- National Insurance branches
We will attach the name of the branch, address, phone number, and fax number.
- Government ministries
We will attach most of the details for branches of government ministries, such as the Ministry of the Interior, the Licensing Offices, etc.
- Large companies
In the case of records containing information about one or more branches of a large company, we will attach the company’s official name to every record, but will store the information for every branch.
- Marketing chains
We will identify the branch and the chain to which it belongs.
- Improvement of information system performance
Target Data has unique methods for improving information system performance. We can improve the processes of retrieval, generation of reports, queries, and transfer of data from one database to another. All these processes may be made more efficient and rapid. This is done by analyzing the most and least frequent processes in the organization, and by specifying priorities by making trade-off between the following criteria:
– Storage area and speed. (This is done by constructing search keys as required or deleting keys no longer required.)
– Saving copies of items of information separately, in order to provide fast access to them, as compared to saving pointers to these items of information, in order to save storage space at the expense of the response time required to retrieve them.
We specialize in providing support for companies and organizations that are just starting, or are in the middle of the process of, assimilation of a CRM system.
- Data conversion
It is frequently necessary to once-only or currently convert the contents of one database to the format of another. In many cases the company receives a file of data from a customer or from another source and it is necessary to import these data to the organization’s database. In other cases use is made of a number of applications or database and it is necessary to import or export between different formats.
Target Data offers one-time services for data conversion between different systems, providing support for most information systems including databases, off-the-shelf applications such as Access, Excel, Office, dedicated applications such as WinFax, and applications supporting organizers, such as Palm, Visor, etc.
- Constant synchronization and updating
In addition we offer organizations an application for current synchronization between different systems and platforms. This is a more complex process and involves an initial investment of resources, but as a result the need is removed in the future for conversion, and import and export processes. A synchronization system such as the MultiSyncTM system of Target Data is installed in the company’s computers and constantly checks which data have been updated and in which information system. If necessary automatic updating and conversion is performed on the relevant data in the
required information systems or software products. For example, in a computer in which the MultiSync system is installed, linking Palm and WinFax applications, the system will identify the addition of a new customer to the WinFax address book, and will automatically update the Palm address book accordingly, and vice versa.
- Excel applications
A unique type of applications in which we specialize are the tools installed as an addition to the Excel application. These are tools used to automatically execute a variety of operations, such as conversion of accounts between various formats (e.g., a customer requires an application to convert accounts submitted to customers in a new format recently specified by one of his major customers). As a result, the need arises to convert hundreds of accounts from the old to the new format. This tools aids in performing automatic conversion of formats.
DataTune was provided until 2004 and is now discontinued.
About the Author
Michael Haephrati was born in 1964. An inventor, Hi-Tech specialist, music composer and a father.