In the last installment, we looked at the general approach to locating, converting and completing a quality direct marketing house list. Here, I'm going to show you the steps for handling a simple list. In the next installment, we'll dig into something more complex and more useful.

The goal of today's exercise is to take an online directory of nearly 2,000 contacts located two levels beneath a home page, and turn it into a database- or contact manager-ready file. We're going to use two tools to do this: Acrobat 7 (the latest release) and Word 2002.

Our goal will be to create a well-targeted mailing list in Access format with 2,000 names in under 20 minutes (not counting background activities)—starting from scratch.

Our list will be ready to go after spending 45 seconds per name, which I consider to be a reasonable time to find and document a potential lead.

This, and the next, article is very procedural. To help, I've created a ZIP file you can download containing a set of files that you can use to get a more concrete idea of what I'm doing. Those files are the following:

  1. First—this is a 535 KB PDF file showing what the initial download page contains.

  2. Full—this is a 2.5 MB PDF file containing the entire list as I capture it from the web.

  3. Sample—this is a 150 KB PDF file containing only about 10 pages from that full list.

  4. Initial RTF—this is a 4.6 MB Word document containing the pre-processed list.

  5. Sample Initial RTF—this is a 140 KB file containing the first few pages of Initial RTF, for purposes of small download review.

  6. Finished RTF—this is a 1.8 MB RTF document containing the end result.

  7. Finished TXT this is a 625 KB TXT file—and this is the file you'll eventually import into your database.

You're encouraged to use these files to execute the steps I describe here.

Read the Rules

The list we're going to work with today, and in the next installment as well, are real lists. Before you do the same things we're going to do here with lists of relevance to you, make sure you read all terms and conditions of use posted on the site you're gathering from. Some sites specifically disallow what we're about to do. If a site says I can't use their information this way, I don't.

OK. Let's start the 20-minute timer.

Step One—Find The List

I chose an industry at random to use for the examples: I imagined myself a manufacturer of seals and rings (e.g., washers, O-rings and so on) for machinery. My marketplace became industrial equipment manufacturers. My Google search was:

directories industrial manufacturers

and one of the first items on the list was perfect for me: Industrial Quick Search (

Take a look at the Web page—you'll see a list of around 180 categories. Click on any one and you'll see the company information—that's what we're after. We have five pieces of information: name, city, state, phone, and a description of the company (plus, if we wanted, a general category description). This is perfect for a general telemarketing campaign (not enough information for a targeted or personalized approach—but we'll deal with all that in the next installment).

Step Two—Define the Correct Acrobat Settings

We're going to capture that home page and convert it to a PDF file—then from within Acrobat we're going to download the rest of the information. In order to ensure we get the best, most efficient results, we need to adjust the default web capture settings in Acrobat.

Acrobat is going to want to try and capture the web site as faithfully as possible—and you have to stop it from doing that since it means you'll have far too much formatting to contend with. You don't want to maintain table structures, graphics, colors and the like.

Here are the basic Acrobat steps:

1. From File > Create PDF > From Web Page, you'll this dialog box (your settings need to match the examples here exactly):

2. Hit the Settings button, and you'll see a dialog box with two tabs. Set the General tab as you see here.

In the box above, the only thing you're concerned about is that Acrobat creates PDF tags for the document, to maintain a workable format when we convert the PDF file in a little while.

3. Select HTML under File Type Settings (this is optional) and click the Settings button.

The only thing I recommend here is that you deselect Convert Images, so that the download is faster (we only care about the text anyway). Click OK when you're done.

4. Open the Page Layout Tab

On this box, I've increased the size of the created page, to avoid having relevant information span more than one page.

We also need to tell Acrobat that we want those pages to load in the PDF document we're creating, and not in a browser window. Select Edit > Preferences and then the Web Capture settings at the bottom of the Preferences box—make sure your preferences look like this:

Step Three—Download the File

Copy and Paste the web site URL into the Create PDF from Web Page dialog box. There are other ways to capture web pages (Acrobat 7, for instance, adds buttons on IE that allow you to do this)—but the copy and paste method is available in earlier versions of Acrobat so we'll use that here.

Click the "Create" button. A status box opens letting you know the progress of your download.

When it completes, you'll have a single-page document that looks like the file named First in the abovementioned downloadable ZIP file.

Step Four—Completing the PDF

The links in the center of the page are our target. If we click on them the specific web page will load.

There are a number of ways to capture these pages. You can click on each one individually—but that takes a long time and you have to sit, wait and watch. You can also use the Advanced > Web Capture > Append All Links on Page command. This is more convenient but it will download every link on the page—and as you can see there are some links that you don't want—often many (banner ads for instance).

To do this select Advanced > Web Capture > View Web Links.

This lists every link that has not already been downloaded on the page (not available in Acrobat 5 and below):

Hit the Select All button. Then, as I've done here, deselect all the irrelevant links (in this case, all those links that do not point to a listing page) by holding down the Control—or Command-- key and clicking on it with your mouse.

Hit the Download button and the highlighted pages will be added to the document.

In this case it takes about 10 seconds (at my bandwidth) to download each page—so you've got a few minutes wait (but remember, that doesn't get added to our total time, since you can do something else while this is happening).

Look at Full or the much shorter Sample file to see what the file looks like when all the pages are downloaded.

About the File Structure

Each listing contains five relevant components:

  1. A company name hyperlinked to the company web site
  2. The city
  3. The state
  4. The phone number
  5. A description of what the company does

This will form the foundation of our prospect database—using only this information you can make a telephone call and know something about what the company does when you have that conversation. You'll still have to ask for "the person in charge of the supply chain" or whatever—fixing that is what we'll do in the next installment.

Our goal is to turn that into a database of 2,000 names that looks like this:

BECO Manufacturing Co., Inc. Laguna Hills CA 800-926-2326 BECO is a leading manufacturer of air cylinders, pneumatic valves, hand sprays, tanks and a variety of other products. BECO's PVC air cylinders are noncorrosive, inexpensive, durable and lightweight. At BECO, we are happy to discuss customizing our cylinders to fit your company's needs. Call us today!

And we want to do it in under 20 minutes of work time.

So far we've spent about two of those minutes.

We have 18 left.

Step Five—Import Into Word

Our next step is to get this into Word—we'll use the RTF format, since it is common to earlier version of Acrobat.

Select File > Save As and then the RTF option beneath the filenames. Hit the Settings button. Acrobat wants to recreate the file exactly as it was on the web page, and we don't want that. We only want a clean, single column list, with the URLs for each web site intact.

This one's easy: just Deselect everything.

Save the RTF file, then open it in Word (might be a good time for lunch, this could take a while depending on the number of entries in the file). You can see what it looks like by checking out the Initial RTF file. (Make sure to display all formatting marks from the Tools > Options > View menu by checking the All box in the Formatting Marks area.)

Only 16 minutes left to go.

Codes and Wildcards

One thing we have to do is use character codes and wildcards in find and replace specifications. Let's look at what we need to use here:



^p Paragraph
^t Tab
^# Any digit
^& (replace field only) Use the characters specified in the Find area as the Replace string




Any character
@ Any number of the preceding character



b A space—used in this article only... In the find and replace box, just use the spacebar

Step Six—Clean up and Format the List

They key to working through the mess of extraneous information you'll see is to find and replace information based on patterns—repetitive phrases, font conventions, number of carriage returns used consistently to separate one item from another, and things like that. Every list is different, every list has its own characteristics. But they all have patterns.

The most obvious issue here involves the tables that precede each category listing of companies. We need to get rid of them. (There are no border lines in the tables so they don't pop right out at you, but if you search for in Word you'll find the first one.

There is one area where Wildcard specifications create a little problem for us: Word won't recognize the paragraph code (^p) in Wildcard mode—so we have to replace it with an unusual character (one that doesn't appear anywhere else)—actually, in this case, we have to replace a string of 11 paragraphs in a row—the number of returns that follow the extraneous tabular information.

We'll use the tilde (~) as the replacement character since it is seldom used and is not a Wildcard character itself:

1. Open Edit > Replace.

With Use Wildcards deselected, enter the following find and replace specification (copy and paste it from here):

Find what: ^p^p^p^p^p^p^p^p^p^p^p

Replace with ~

The table is no longer followed by a lot of paragraph symbols, but instead by a single ~.

Now let's get rid of the useless information.

You'll see, for instance at the end of every category listing the phrase "Additional air-cylinders Listings - Page 2" or something similar appears. That's followed by general contact information for the directory, the tabular category listing, and then the tilde. That's always the same and predictable; within the table itself, however, there is unique content describing each category.

Bet we can get rid of all this with a single find and replace, using Wildcards.

2. In the Find and Replace dialog box, click on the More button.

With Use Wildcards selected, enter the following find and replace specification:

Find what: Additional*@~

Replace with: (leave blank)

The phrase (leave blank) of course means to leave the Replace field empty. Here we've told Word to search for any number of any character that begins with the word Additional and ends with a tilde. We don't care what content is in between that word and that character.

3. Click Replace All.

4. There is a little extraneous text left over at the start and at the end of the document. Delete it.

5. You might also have some rogue tildes still hanging around. Replace those with nothing:

Find what:~

Replace with: (leave blank)

Once that is done we have a pretty nice looking document.

So far we've used up around 7 of our 20 minutes. Only 13 left to go.

Step Seven—Convert to a Table

In preparation for conversion into a database table, we want to separate each of our column fields from the others by a tab (or a comma, but tabs are far simpler with this kind of activity).

If we look at a couple of records, we'll see what needs to be done:

California Compressor, Inc.   Huntington Beach , CA  714-847-9560

Compressed Air Professional, including Gas, Scroll, Medical, Rotary, Reciprocating, Oil-Free, Breathing Air and Vacuum Systems. Sales, Parts, Service, Installation and Engineering. Authorized Distributor: Powerex Compressor, Kaeser Compressor, Ultrafilter-Donaldson Filtration and much more.

Gas Equipment Engineering Corporation   Milford , CT  800-227-9389

Gas Equipment Engineering is a leading manufacturer of high-pressure multistage reciprocating compressors — air and gas compressors. Our units are designed and built to provide many years of reliable service even under severe operating conditions. Our company was founded in 1921.

The patterns are clear:

1. Deselect Use Wildcards and enter the following find and replace specifications

Find what: ®

Replace with: (leave blank)

(This gets rid of the registration symbols.)

2. The company name is separated from the city by three spaces. Replace this with a tab.

Find what: bbb

Replace with: ^t

3. City is separated from state by a space-comma-space pattern. Replace this with a tab.

Find what: b,b

Replace with: ^t

4. The state is separated from the phone number by two spaces. Replace this with a tab.

Find what: bb

Replace with: ^t

5. The phone number is separated from the description by a space and a paragraph marker.

To do this we're going to direct that we find all instances where any four numerals is followed by a space. We'll then direct that Word replace that string with itself (four numerals and a space), followed by a tab.

Find what: ^^^^b

Replace with: ^&^t

6. Now we have one last string to clean up—the tab and two paragraph markers that follow the phone number need to be cleaned up.

We fix this by, first, disabling Use wildcards (remember, paragraph markers are not allowed in Wildcard mode), and then entering:

Find what: ^t^p^p

Replace with: ^t

7. And there is also a little bit of slop—an occasional instance where a record is separated by a paragraph marker, a space, and a paragraph marker.

We know how to fix that by now don't we?

Find what: ^pb^p

Replace with: ^p

You should see a cleaned up file that looks like the Finished RTF file.

The records are separated internally by Tabs, and from one another by a paragraph.

8. There is one last anomaly.

A lot of the records contain tabs as the last field. Since every tab will become a table column, this means you will have empty columns in the table to deal with—this happens almost every time. Easy to fix.

Find what: ^t^p

Replace with: ^p

Repeat this until Word reports that it has found and replaced 0 occurrences.

9. Now select the entire file and use Table > Convert > Text to Table.

We do this just to make sure the information is clean: if it looks OK in a Word table, it will convert fine to a database.

You'll notice a few entries in the description column that did not start a new row. Fix those by hand.

10. Once you've done that, convert the table back to text.

You now have a tab delimited file, ready to use as a source of telemarketing information for your database or contact management system, containing 1848 records.

And we did it in 17 minutes.

What's Next

There is much that can be done to make this file more useful yet. In this case, for instance, we still need to add the name of the key contacts and their addresses.

Also, there are much more complex documents you'll create that require a much broader understanding of how to use Wildcards and other Word functions. We'll touch on both next time.

Subscribe's free!

MarketingProfs provides thousands of marketing resources, entirely free!

Simply subscribe to our newsletter and get instant access to how-to articles, guides, webinars and more for nada, nothing, zip, zilch, on the house...delivered right to your inbox! MarketingProfs is the largest marketing community in the world, and we are here to help you be a better marketer.

Already a member? Sign in now.

Sign in with your preferred account, below.

Did you like this article?
Know someone who would enjoy it too? Share with your friends, free of charge, no sign up required! Simply share this link, and they will get instant access…
  • Copy Link

  • Email

  • Twitter

  • Facebook

  • Pinterest

  • Linkedin


image of Michael Fischler

Michael Fischler is founder and principal coach and consultant of Markitek (, which for over a decade has provided marketing consulting and coaching services to companies around the world, from startups and SMEs to giants like Kodak and Pirelli. You can contact him by clicking here.