looking for regular expression to strip html

Hi, I’m looking for a way (or reg-exp), to strip an html-area.

I’m using the text in an other application, where only test or test are allowed.

so all the html, head, font tags, must be stripped.

also things as &nbsp are not allowed.

So in short, only the shown text and the bold and italic tags are allowed.

Anyone that can help?

Hi Harjo,

So in short you want all HTML entities stripped except bold and italic.
What about entities for things like < (<) or > (>) or high-bit characters like the euro sign or TM etc. Should those be stripped as well or should they be de-encoded?

You’re a Mac guy right? How about BBEdit or the free version (TextWrangler)? If you need to use the regular expression within Servoy (presumably) then that will at least head you in the right direction. I know regular expressions pretty well but not HTML. BBEdit though is especially used by web designers, etc. and I’m sure will get you going.

Hi Robert,

no they should not be stripped, but encoded!

Hi Harjo,

But should they stay untouched so they still render correctly in HTML or do you want them as plain text.?

yes plain text! :-)

nice challenge hugh? :wink:

I found something here:

http://www.jamescrooke.co.uk/articles/r … trip-tags/

but I can’t translate that to javascript. :(

It’s a nice one, because there is an argument, which tags you want to allow!

(oh yeah, the
is allowed also! ;-) )

That code would translate to this:

/*    
   =============================================================================== 

   Based on stripHTML (ASP script) by James Crooke 
   http://www.jamescrooke.co.uk/articles/regular-expression-asp-strip-tags/ 

   Adapted for Servoy by Robert J.C. Ivens, ROCLASI Software Solutions 

   ===============================================================================    
*/ 
var sHTML       = arguments[0], 
    sAllowTags  = arguments[1], 
    aMatches    = null, 
    sTagName    = ""; 

sAllowTags      = ("," + utils.stringReplace(sAllowTags, " ", "") + ",").toLowerCase(); 
aMatches        = sHTML.match(/<(.|\n)+?>/g); 

if( aMatches ) 
{ 
    for ( var i = 0 ; i < aMatches.length ; i++ ) 
    { 
        sTagName   = aMatches[i].replace(/<(\/?)(\w+)[^>]*>/,"$2"); 
        sTagName   = "," + sTagName.toLowerCase() + ","; 
        
        if ( utils.stringPatternCount(sAllowTags, sTagName) == 0 ) 
        { 
            sHTML = utils.stringReplace(sHTML, aMatches[i], ""); 
        } 
    } 
} 
return sHTML;

Call this method like so ```
sHTML = myMethodName(sHTML, “b,i,br”);

But this still doesn't solve your HTML entity de-encoding problems. Also embedded CSS stylesheets are not filtered out.
Maybe for a next version <img src="{SMILIES_PATH}/icon_wink.gif" alt=";)" title="Wink" />

Hope this helps.

Okay, if you want to filter out the complete header (title, meta tags, etc) of a webpage, any embedded scripts and stylesheets add the following code right after the variables declaration.

sHTML			= sHTML.replace(/<head>.+<\/head>/,"");        // Strip the whole header
sHTML			= sHTML.replace(/<script.*>.+<\/script>/g,""); // Strip any embedded JavaScript
sHTML			= sHTML.replace(/<style.*>.+<\/style>/g,"");   // Strip any embedded StyleSheets

Hope his helps.

Maybe getAsPlainText can help as described at:
http://www.servoymagazine.com/home/2005 … on_ge.html

Dean Westover

Heh, now you tell me ;).

That would indeed solve the whole de-encoding issue but it also strips out any bold and italic codes.

thanks Robert,

this will give me a jumpstart for sure! :D
I think I have a solution for the encoding problem.

soon as tested it, I will post it here!

It seems my old method doesn’t work that well with multi-line matches.
Here is a method that does work.

var sSource			= arguments[0], 
    sFrom			= arguments[1], 
    sTo				= arguments[2], 
    sReplaceWith	= arguments[3], 
    nStart			= 0, 
    nEnd			= 0, 
    sTmp			= ""; 
    

nStart	= sSource.indexOf(sFrom); 
nEnd	= sSource.indexOf(sTo); 
while (nStart > -1 && nEnd > -1 ) { 

    sTmp = utils.stringLeft(sSource, nStart); 
    sTmp += utils.stringMiddle(sSource, nEnd+sTo.length+1, sSource.length-nEnd) 
    sSource = sTmp; 
    
    nStart	= sSource.indexOf(sFrom); 
    nEnd	= sSource.indexOf(sTo); 

} 
return sSource

Just call that like so:

sHTML = myReplace(sHTML,"<head>","</head>","");
sHTML = myReplace(sHTML,"<script","</script>","");

Hope this helps.