looking for regular expression to strip html

Harjo · March 3, 2007, 6:14pm

Hi, I’m looking for a way (or reg-exp), to strip an html-area.

I’m using the text in an other application, where only test or test are allowed.

so all the html, head, font tags, must be stripped.

also things as &nbsp are not allowed.

So in short, only the shown text and the bold and italic tags are allowed.

Anyone that can help?

ROCLASI · March 3, 2007, 6:21pm

Hi Harjo,

So in short you want all HTML entities stripped except bold and italic.
What about entities for things like < (<) or > (>) or high-bit characters like the euro sign or TM etc. Should those be stripped as well or should they be de-encoded?

john.allen · March 3, 2007, 6:22pm

You’re a Mac guy right? How about BBEdit or the free version (TextWrangler)? If you need to use the regular expression within Servoy (presumably) then that will at least head you in the right direction. I know regular expressions pretty well but not HTML. BBEdit though is especially used by web designers, etc. and I’m sure will get you going.

Harjo · March 3, 2007, 6:22pm

Hi Robert,

no they should not be stripped, but encoded!

ROCLASI · March 3, 2007, 6:25pm

Hi Harjo,

But should they stay untouched so they still render correctly in HTML or do you want them as plain text.?

Harjo · March 3, 2007, 6:25pm

yes plain text! :-)

nice challenge hugh? :wink:

Harjo · March 3, 2007, 6:35pm

I found something here:

http://www.jamescrooke.co.uk/articles/r … trip-tags/

but I can’t translate that to javascript.

It’s a nice one, because there is an argument, which tags you want to allow!

(oh yeah, the
is allowed also! ;-) )

ROCLASI · March 3, 2007, 8:00pm

That code would translate to this:

/*    
   =============================================================================== 

   Based on stripHTML (ASP script) by James Crooke 
   http://www.jamescrooke.co.uk/articles/regular-expression-asp-strip-tags/ 

   Adapted for Servoy by Robert J.C. Ivens, ROCLASI Software Solutions 

   ===============================================================================    
*/ 
var sHTML       = arguments[0], 
    sAllowTags  = arguments[1], 
    aMatches    = null, 
    sTagName    = ""; 

sAllowTags      = ("," + utils.stringReplace(sAllowTags, " ", "") + ",").toLowerCase(); 
aMatches        = sHTML.match(/<(.|\n)+?>/g); 

if( aMatches ) 
{ 
    for ( var i = 0 ; i < aMatches.length ; i++ ) 
    { 
        sTagName   = aMatches[i].replace(/<(\/?)(\w+)[^>]*>/,"$2"); 
        sTagName   = "," + sTagName.toLowerCase() + ","; 
        
        if ( utils.stringPatternCount(sAllowTags, sTagName) == 0 ) 
        { 
            sHTML = utils.stringReplace(sHTML, aMatches[i], ""); 
        } 
    } 
} 
return sHTML;

Call this method like so ```
sHTML = myMethodName(sHTML, “b,i,br”);

But this still doesn't solve your HTML entity de-encoding problems. Also embedded CSS stylesheets are not filtered out.
Maybe for a next version <img src="{SMILIES_PATH}/icon_wink.gif" alt=";)" title="Wink" />

Hope this helps.

ROCLASI · March 3, 2007, 10:13pm

Okay, if you want to filter out the complete header (title, meta tags, etc) of a webpage, any embedded scripts and stylesheets add the following code right after the variables declaration.

sHTML			= sHTML.replace(/<head>.+<\/head>/,"");        // Strip the whole header
sHTML			= sHTML.replace(/<script.*>.+<\/script>/g,""); // Strip any embedded JavaScript
sHTML			= sHTML.replace(/<style.*>.+<\/style>/g,"");   // Strip any embedded StyleSheets

Hope his helps.

Westy · March 3, 2007, 11:13pm

Maybe getAsPlainText can help as described at:
http://www.servoymagazine.com/home/2005 … on_ge.html

Dean Westover

ROCLASI · March 3, 2007, 11:19pm

Heh, now you tell me .

That would indeed solve the whole de-encoding issue but it also strips out any bold and italic codes.

Harjo · March 3, 2007, 11:36pm

thanks Robert,

this will give me a jumpstart for sure!
I think I have a solution for the encoding problem.

soon as tested it, I will post it here!

ROCLASI · February 1, 2008, 11:02am

It seems my old method doesn’t work that well with multi-line matches.
Here is a method that does work.

var sSource			= arguments[0], 
    sFrom			= arguments[1], 
    sTo				= arguments[2], 
    sReplaceWith	= arguments[3], 
    nStart			= 0, 
    nEnd			= 0, 
    sTmp			= ""; 
    

nStart	= sSource.indexOf(sFrom); 
nEnd	= sSource.indexOf(sTo); 
while (nStart > -1 && nEnd > -1 ) { 

    sTmp = utils.stringLeft(sSource, nStart); 
    sTmp += utils.stringMiddle(sSource, nEnd+sTo.length+1, sSource.length-nEnd) 
    sSource = sTmp; 
    
    nStart	= sSource.indexOf(sFrom); 
    nEnd	= sSource.indexOf(sTo); 

} 
return sSource

Just call that like so:

sHTML = myReplace(sHTML,"<head>","</head>","");
sHTML = myReplace(sHTML,"<script","</script>","");

Hope this helps.

Topic		Replies	Views
remove html tags from text Classic Servoy	3	3562	February 27, 2012
RegExp to convert Hi-ASCII character for html Classic Servoy	2	2039	February 12, 2007
regex problem Classic Servoy	3	2257	February 1, 2008
Regular Expressions Classic Servoy	6	3934	July 1, 2009
HTML to Text Classic Servoy	3	2168	October 9, 2012

looking for regular expression to strip html

Related topics