Fun with MS Word Wildcards

It’s been a while since I’ve done a geeky post (see my April post about setting up WordPress using the latest themes and blocks.).

But every so often (maybe once a month or two) I end up confronting a technical issue which I absolutely must solve, and I cannot find the answer on the web. People say that half of being an IT guy is being able to use search engines to find the information you need. I generally agree, but sometimes the problem is very odd or maybe you don’t know exactly what problem you are seeing or even if you are using the incorrect terminology. Also, search engines are not as helpful for advanced geeks. Almost all the problems you encounter are complex and have an unusual set of circumstances that no forum thread is going to replicate exactly.

Partly the problem is having too many sources of information and being unable to winnow the relevant information. Sometimes postings on user forums are just not relevant. Sometimes you just lack enough information or smarts to understand the answer when it is staring right at you. Yes, I admit it: personal stupidity is often the primary barrier to a solution.

Describing the Problem: A Messy OCR Conversion

Recently I have been using a book scanner (CZUR ET18Pro) which scans the page and then uses the Abby Finereader engine (but NOT the actual software) to handle OCR conversions. It generally works great for books, but not great for scans of journal articles apparently. The scan can be converted into text pretty accurately, but you have two more issues: First, CZUR puts the text content for each page into a separate frame object. Second, some pages (such as pages of a journal) don’t have enough spaces between paragraphs, causing CZUR to insert paragraph marks at the end of EVERY line. (So if a paragraph has 6 different lines, instead of having 1 paragraph mark at the end, it has 6 different paragraph marks).

About the frames problem, I was quickly able to find a solution from the search engines. You could manually remove each frame or you could run a simple VB script to globally remove these frames using (like here or here) .

The second problem — paragraph marks after every line of text is a pretty hard one. I spent days figuring it out. I have two possible solutions: either I manually adjust all the line breaks or I must come up with a way to clean up these OCR documents.

Manually adjusting the line breaks is tedious, but not terribly hard. I just have lots of pages to do; ideally I would have a way to globally clean up these documents with MS Word’s Find and Replace. I’ve certainly used Find and Replace for documentations — even for complex substitutions. But I need to write some find/replaces that won’t cause any further damage and do it in a way that is easy to repeat.

Ultimately I know that I’m going to have to search for a single paragraph mark and replace it with nothing. That is the Big Substitution — and also the potentially most destructive. The trick is protecting the paragraph marks which you want to be there. But how? You should convert these instances to a temporary value which are immune to the Big Substitution — and then after the substitutions are complete, you can convert this temporary value back to a good paragraph mark.

Also, the order is important. If you get the order wrong, you may create extra work.

Microsoft Office offers some powerful functionality for Search/Replace. Indeed Word offers a limited number of preset search parameters. The first question: Do I “enable wildcards” in the search box?

I quickly realized that enabling wildcards offered more options for text processing. It included a lot of syntax for regular expressions. It still is complicated to write your strings to replace, but the big stumbling block is: how do you search for paragraph marks when wildcards are enabled?

Curiously, this is not documented on the MS Word documentation. Here is the magical voodoo answer from this great article:

You may wish to identify a character string by means of a paragraph mark ¶. The normal search string for this would be ^p. ^p does not work in wildcard search strings! It must, however, be used in replace strings, but when searching, you must use the substitute code ^13.

Finding and replacing characters using wildcards By Graham Mayor: (mirrored here)

You read that right: When searching, you must use wildcards and use ^13. But when you are trying to reintroduce paragraph marks into the MS Word file, you must not use wildcards and instead use ^p.

Another gotcha I noticed is that if you remove too many paragraph returns, it will corrupt the files. Instead of getting a clean substitution, the MS Word file is totally blank except for a giant mishapen blob of dark lines and dots.

Here is the series of steps I devised.

  1. Correct the obvious scanning errors, most of which will be highlighted. Usually they are hyphenated words which are broken up into two separate lines.
  2. With wildcards turned on, you will need to manually search for optional hyphenation
    • Goal: remove optional hyphenation
    • Find: ^-^13
    • Replace (leave blank)
    • Manually or Globally? Globally. (this operation works pretty accurately)
    • Comment: Do this first because it messes up the substitutions later. Warning: this will not uncover some of the hyphenated words which were broken into separate lines.
  3. Remove MS page breaks (some of these were accidentally added). Make sure to uncheck Use Wildcards for this find/replace. If it is easier, you can just copy the wildcard in the Find section.
    • Goal: remove all page breaks
    • Find: ^m
    • Replace (leave blank)
    • Manually or Globally? Globally
    • Comment: Uncheck Wildcards and choose Page Breaks from the dropdown under “Special”
  4. Remove MS section breaks.
    • Goal: remove all page breaks/section breaks
    • Find: ^b
    • Replace (leave blank)
    • Manually or Globally? Globally
    • Comment: Comment: Uncheck Wildcards and choose Section Breaks from the dropdown under “Special”
  5. Page through the entire chapter and add an para mark + $sectionbreak$ whenever there should be a thematic section break.
  6. Manually page through the chapter and convert para marks followed by a capital I (Often it just so happens that the word I starts a line without it necessarily being the start of a new paragraph. You need to manually use Find/Replace to make sure you are doing only when I is the beginning of a new paragraph. You substitute these instances with $realparagraph$I Yes, that is an actual capital I after the $ sign
    • Goal: find/replace all new para + I
    • Find: ^13I
    • Replace $realparagraph$I
    • Manually or Globally? Manually
    • Comment:
  7. Now search for every time a para mark is followed by a capital letter. Note that the capital I is not included in this range of values
    • Goal: search for all new line paragraphs except capital Is
    • Find: (^13)([ABCDEFGHJKLMNOPQRSTUVXYZ"“])
    • Replace: $realparagraph$ \2
    • Manually or globally? Globally. It still will correct some things wrongly, but you can fix these.
    • Comment: Find has grouped the two terms with parentheses which allow you to reference them in the Replace statement (Note the \2 with the space before the backslash. I included a regular quote and beginning smart quote in the Find statement, but I think only one of them actually worked in my document. There are a number of false positives, so I recommend doing this manually.
  8. Now for the Big Substitution. Eliminate all single instance of a paragraph return at the end of a line
    • Goal: Eliminate all single instances of a para return
    • Find: ([!^13])^13([!^13])
    • Replace \1 \2
    • Manually or Globally? Globally (but Have that Ctl-Z ready!)
    • Comment: This searches for three elements where the middle element is a paragraph return, but the preceding or succeeding element is NOT a paragraph return. (The ! means “not”; the parenthesis will group things into element 1 and element 2 which are referenced in the replace statement. This step is very tricky. The Find statement prevents MS word from removing every paragraph return — just the ones at the end of the line. I can’t understand why the Find statement requirements 3 (instead of 2) elements, but somehow it never seems to yield satisfying results.
  9. Now you will reintroduce paragraph marks where they are supposed to be
    • Goal: Substitute the holder $realparagraph$ with an actual paragraph return. You must turn off Wildcards!
    • Find: $realparagraph$
    • Replace: ^p
    • Manually or Globally? Globally
    • Comment:
  10. Now you will add extra spaces (in the form of paragraph returns around the section breaks.
    • Goal: Add extra space around section breaks. You must turn off Wildcards!
    • Find: $sectionbreak$
    • Replace: ^p$sectionbreak$^p
    • Manually or Globally? Globally
    • Comment: When I have finished cleaning the MS Word version, I actually use a cut-paste operation to import them into Docbook XML using the Author mode of XML Oxygen editor. I will be replacing the $sectionbreak$ into custom html code, so I don’t need to convert them here. I just want it easy to see where the section breaks are.

There you have it. I tested this thoroughly on my test file. Over the next week I’ll do it with different files. Maybe I’ll uncover some anomalies or extra steps, but this is probably enough (I’m guessing).


Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.