Topic: Other

How To Produce An "error File" Using Spellcheck

Posted by telemoxie on 500 Points
I remember reading that you can extract from a document a list containing technical jargon and unusual words by running a spellcheck program and looking at an "error file."

Such a list could be useful, for example, when creating a glossary.

Does anyone know how to produce such a file? Thanks in advance.
To continue reading this question and the solution, sign up ... it's free!


  • Posted by Jay Hamilton-Roth on Accepted
  • Posted by telemoxie on Author
    Jay: thanks, those posts are headed in the right direction, but I'm trying to find a bit of an easier way.

    As I think more about it, I remember the procedure (using some word processor from the last century) as being that you run the spellcheck, and the words to your personal list of approved spellings, and then print out that list.

    I'm dealing with a text file Which has grown to over 120,000 lines of text. I'm thinking I might need to learn Python or something similar. I plan to keep looking, and I'll post updates as I find them. Thanks for the help so far.
  • Posted by steven.alker on Accepted
    Which Wordprocessing program are you using? Or more pertinently, which spellchecker as they can be independent now? For example, Grammarly produces a file of exceptions to its dictionary in a separate file, but you have to accept them first. Likewise, Word produces your own dictionary, but you first have to accept every unrecognized term.

    If you are looking at recognizing every unrecognized term in a virgin document presented to a virgin word processor for the first time (Capturing everything with a red line under it??) then that is a matter of programming. You can do that within Word but you need to evolve a macro which recognizes the code for {I have a squiggly red line under me} (Believe it or not, there is a code for that!!)

    So, are you asking for a spellchecker to give up its file of all terms you have had to accept (Realtively easy, it is in dictionary?.doc. (I can't remember the name)

    Or are you looking to chuck a text file at a spellchecker and hope that it rejects all your jargon?

    If you tell me which it is, or if you had something else in mind, it would clarify the question for me.
  • Posted by telemoxie on Author
    Thanks for the info. Yes, doing it automatically would seem to require some programming.

    I've been doing some research for a book, placing comments and notes and references in a text file, and the text file has grown to well over 100,000 lines. There are lots of basic manipulations I would like to do, but they seem to either require programming, or purchasing Software which looks both expensive and complicated to me.

    I'm considering learning more about Python, or possibly batch programming.

    Regarding my word processor, I'm using PolyEdit, which I really like. I really don't know if the guy still in business, but I find it to be very powerful software.

    Fortunately, the information that I was looking for is in all capital letters, and so I was able to create a list of them using a series of searches and replace, as well as the sort and select capability of PolyEdit.

    Thanks again for your assistance. Take care.
  • Posted by steven.alker on Member
    Hi Telemoxie: Thanks for the details - very useful. The task seems to be simpler than I thought, depending on where the dictionary for PolyEdit is kept and in what form. If it is in text then you have a simple programming job which you could inexpensively contract out.

    Depending on the line length, 120,000 lines equals up to 1,200,000 words or strings.
    The task is then to compare each and every of the 1,200,000 words against a dictionary of probably about 30,000 words max. 10,000 more like. Google is hopeless as it has a dictionary as big as Goofle Search or about 640,000 words!

    Then you test word one against all 30,000 in the dictionary and if it is not there, mark it.
    Then look for other instances of it is the text string and delete them.
    Then the next word and test that.
    Same routine
    It gets faster as you progress, so the entire job might only taks about 30 minutes.

    Now all you need is a text version of that dictionary and someone with an hour spare to code a string/exception and list routine.

    Apart from locating a coder with the skills and the time, you are nearly home and dry.

    Access to that virgin dictionary in text form seems to me to be the biggest hurdle, but you can download free in text form dictionaries for other programmes.

    I have run this past a programming buddy and he says that it is solid, but that he is fully booked.


Post a Comment