I have used the following two scripts and found them to be not very impressive.
html2text (Python script)
Html2text (Perl script)
The Python script converts a HTML page into Markdown (a text-to-HTML format) which I don't want. I want text only.
The Perl script requires the input to be "normalized" by a program such as sgmlnorm before it could process it. Apart from this, the script doesn't work well for all the documents. It is limited to certain tags and has to be modified to get it to work for other tags. The text between the tags that are not handled just vanish from the output. Lets see if I can modify it to work for my documents atleast.
html2text (Python script)
Html2text (Perl script)
The Python script converts a HTML page into Markdown (a text-to-HTML format) which I don't want. I want text only.
The Perl script requires the input to be "normalized" by a program such as sgmlnorm before it could process it. Apart from this, the script doesn't work well for all the documents. It is limited to certain tags and has to be modified to get it to work for other tags. The text between the tags that are not handled just vanish from the output. Lets see if I can modify it to work for my documents atleast.