WordPress: Output Clean and Valid HTML Content

— I found that more than often clients generate a lot of garbage html in the post content with things like divs, p , a, etc. tags which haven’t been closed or opened properly. This makes the website fail to validate against W3C standards which in between other things, could potentially affect the site’s ranking and its accessibility. Of course asking clients to switch to the HTML view to clean the content is out of the question, so a solution is needed, and been that this could happen to any site makes it something that we should take in count upfront when building a professional website.

→ Clean The Content Function: ↓ Download

After some research on the subject I found htmLawed, a PHP code to purify and filter HTML that works incredibly well. Props to the htmLawed team for their awesome work!. To merge it with WordPress I wrote a little function which automatically filters the content coming from the post. The results are amazing, htmLawed cleans and filters all the bad content and makes the pages validate again right away!

	include_once ( TEMPLATEPATH . '/htmLawed.php' ); // THIS FILE SHOULD RESIDE IN THE THEME FOLDER.

	function clean_the_content( $content )
	{
		$szPostContent = $content;
		$szRemoveFilter = array( "~<p[^>]*>\s?</p>~", "~<a[^>]*>\s?</a>~", "~<font[^>]*>~", "~<\/font>~", "~<span[^>]*>\s?</span>~" );
		$szPostContent = preg_replace( $szRemoveFilter, '' , $szPostContent);
		$szPostContent = htmLawed($szPostContent);
		return $szPostContent;
	}

	add_filter('the_content', 'clean_the_content');

To use this function simple drop the two files in your theme directory and you are set to go!. Hope you find this post useful. Cheers!

Tags: , , , ,

Friday, October 30th, 2009 WordPress

10 Responses to “WordPress: Output Clean and Valid HTML Content”

  1. This sounds perfect! I will check it out immediately!

  2. Connie on November 1, 2009.
  3. Nice idea. I find it quite funny how the default theme doesn’t validate yet they put a link to the validator in the blogroll :)

  4. Ben on November 1, 2009.
  5. Hi, Nice solution guys. This is something that really annoys me with content management systems, you work on a site perfecting every detail then you hand over to the client and they wrack it with rubbish content. Obviously part of the designer brief is to ensure an end to end solution but sometimes it doesnt work out that way, but this could be a real time saver in educating the client.

    Don’t suppose anyone has found a similar solution for drupal ? as drupal is even worse than wordpress for this sort of problem.

    Nice work !

  6. Shane Giffiths on November 2, 2009.
  7. Wow thanks for the solution … this really help me :D

  8. FoO Iskandar on November 2, 2009.
  9. @Shane,

    it’s not only the clients, who wreck the valid code, unfortunately a lot of modules do that as well and there this nifty tool won’t help…

    must I name Joomla! ? These thousand shitty modules with spell-errors, code-mistakes, wrong charactersets? ;=(

  10. Connie on November 2, 2009.
  11. Great post. But most of the time I see errors for ampersands in urls. Is it possible to also replace these unencoded ampersands with this code?

    Thanks,
    Pierre

  12. Pierre on November 4, 2009.

Leave a Reply

About Me

Matt Varone - Matias Varone - sksmatt
HI there,

I'm a freelance creative web developer, UI designer and hobbyist musician.

Twitter Status

Flickr Gallery

    Blue WallClutter drawerCS4 Replacement iconsCS4 Replacement icons

Scrnshots Gallery

  • Screenshot from ScrnShots.com
  • Screenshot from ScrnShots.com
  • new site im working on
  • Screenshot from ScrnShots.com