Cleaning "Dirty" Data

Ron de bruin and Norman Harker, Last update 28-March-2005

Go to the Excel tips page

 


To
1. Remove excess spaces,
2. Substitute " " for all CHAR(160), and non printing characters (especially CHAR(10)) and,
3. Remove ".", "?" and "!" from end of text:

These are formulas or derivations from them are especially, but not solely,
used for preliminary handling of "bad" text entries and "bad" imported data.

Use:
=TRIM(CLEAN(SUBSTITUTE(LEFT(TRIM(A7),LEN(TRIM(A7))-OR(RIGHT(TRIM(A7))={"?","!","."})),CHAR(160)," ")))

Formula Comments:
Wow! 7 different Text functions and a coerced logical formula using an array structure in a calculation (of length)!
Look Mum! No VBA!
The above formula is a composite of the two formulas below. We've further explained the construction of the second formula.
Note that if used on a large data range, the result will be increased size of workbook and slower re-calculation.
In most cases this formula will be used once only and after use you should use Copy > Paste Special > Values > OK.
The source data can then be deleted or you can copy result over source data.

To just Remove Excess Spaces and Substitute " " for CHAR(160) and non-printing characters (especially CHAR(10))
Use:
=TRIM(CLEAN(SUBSTITUTE(A7,CHAR(160)," ")))

Formula Comments:
The formula replaces the non breaking character space CHAR(160) by a space.
This often causes trouble with data imported from HTML sources.
CLEAN removes all non printed characters with the most common "culprit" in Excel being CHAR(10).
CHAR(10) is inserted when you use Alt-Enter to force a line wrap. It is not
"seen" in the cell it is entered in but will appear as a box in cells that reference it.
The logic of the formula is that we TRIM after CLEANing after SUBSTITUTE of CHAR(160).

Just Remove Terminating Punctuation and TRIM
Use:
=LEFT(TRIM(A7),LEN(TRIM(A7))-OR(RIGHT(TRIM(A7))={"?","!","."}))

Formula Comments:
First look at how we determine if there are terminating punctuation characters:
=-OR(RIGHT(TRIM(A7))={"?","!","."})
Note the "-" before the OR. This forces a return of -1 for TRUE and 0 for FALSE.
We've used an internal array within the OR function to check for existence
of "?", "!" or "." in the TRIMmed target cell.
We have to TRIM the target cell in this OR function and elsewhere in the
main formula just in case some darned fool has put a space after the punctuation.
The use of the internal array structure allows us to cycle through the options
efficiently and without a long OR function that tests different values of the same parameter.
The OR function will normally return TRUE or FALSE but we negate the function and force to return -1 or 0.
Forcing to -1 or 0 allows us to use this element to calculate the LENgth of the TRIMmed formula for use by the LEFT function.
This forcing return of logical expressions to is commonly used where
addition, subtraction or multiplications by 1 or zero can serve a given objective.
 


Download a workbook showing these formulas in operation that allows you to
insert test data for "cleaning".
 


Here's a screen shot from the workbook:




Acknowledgements: Dave McRitchie's website at
http://www.mvps.org/dmcritchie/excel/strings.htm as a general resource and
information repository on string manipulation.