Missing < > & characters from article content

Support, bug reports, etc. go here.

Missing < > & characters from article content

Postby adq on Mon Oct 06, 2008 12:01

This isn't a support request, just a problem I've tracked down. I updated my server recently and suddenly noticed that tt-rss was stripping html characters from feeds seemingly at random. A bit of hunting later showed it was a problem with libxml2 and php. The bug here http://bugs.php.net/bug.php?id=45996 describes exactly what I am seeing.

I downgraded my server to libxml2 2.6.32 and it started working properly again.

I also tried upping to libxml2 2.7.1 and rebuilding php against that (in case it had some compile-time detection logic), but the problem recurred.

I'm happy with an older version of libxml2, and obviously this isn't a tt-rss issue; just thought I'd mention it in case someone else has the same problem.
adq
 
Posts: 1
Joined: Mon Oct 06, 2008 11:55

Re: Missing < > & characters from article content

Postby fox on Mon Oct 06, 2008 12:25

Eh, I never encountered this problem. Thanks for reporting!
User avatar
fox
Site Admin
 
Posts: 1204
Joined: Sat Aug 27, 2005 21:53
Location: Saint-Petersburg, Russia

Re: Missing < > & characters from article content

Postby Qwark on Mon Oct 06, 2008 22:14

I have (had) the same problem as well.

I modified simplepie.inc to replace the problematic chars to their numeric equivalents just before parsing. the XML parser leaves them alone and everything works again as expected (for me)

(remember to set ENABLE_SIMPLEPIE to true in config.php to make ttrss actually use simplepie) But I guess this works just as well in magpie if you paste these 3 lines at the right place.



Code: Select all
diff -Naur --exclude .backups --exclude icons --exclude '*.png*' tt-rss-20080919.org/simplepie/simplepie.inc tt-rss-20080919.alex/simplepie/simplepie.inc
--- tt-rss-20080919.org/simplepie/simplepie.inc 2008-09-19 02:00:03.000000000 +0200
+++ tt-rss-20080919.alex/simplepie/simplepie.inc    2008-09-21 01:29:13.000000000 +0200
@@ -12761,6 +12761,10 @@
        xml_set_character_data_handler($xml, 'cdata');
        xml_set_element_handler($xml, 'tag_open', 'tag_close');

+       $data=str_replace("&lt;","&#60;",$data);
+       $data=str_replace("&gt;","&#62;",$data);
+       $data=str_replace("&amp;","&#38;",$data);
+
        // Parse!
        if (!xml_parse($xml, $data, true))
        {
Qwark
 
Posts: 3
Joined: Tue Sep 23, 2008 1:28

Re: Missing < > & characters from article content

Postby thecount on Thu Oct 16, 2008 17:02

Same problem, Qwark's fix works for me.
thecount
 
Posts: 8
Joined: Mon Mar 20, 2006 22:26

Re: Missing < > & characters from article content

Postby fox on Tue Oct 21, 2008 12:52

I'm not sure I can merge the fix into trunk, though - I think it might break things for those with non-broken libxml.

I'll keep this thread sticky as a reference for people whose system experience this problem.

Edit: The patch above also solves the problem with missing & (&amp;) in article links.
User avatar
fox
Site Admin
 
Posts: 1204
Joined: Sat Aug 27, 2005 21:53
Location: Saint-Petersburg, Russia

Re: Missing < > & characters from article content

Postby padde on Tue Oct 28, 2008 18:33

Same problem here... php 5.2.6 + libxml2 2.7.2 (Gentoo).
padde
 
Posts: 7
Joined: Tue Oct 28, 2008 18:30

Re: Missing < > & characters from article content

Postby candrews on Tue Oct 28, 2008 23:00

I didn't notice this thread, so I reported a bug here: http://tt-rss.org/trac/ticket/224
candrews
 
Posts: 1
Joined: Tue Oct 28, 2008 22:59

Re: Missing < > & characters from article content

Postby fox on Tue Oct 28, 2008 23:23

I've added the entry to the FAQ which links to this thread and the ticket you created.
User avatar
fox
Site Admin
 
Posts: 1204
Joined: Sat Aug 27, 2005 21:53
Location: Saint-Petersburg, Russia

Re: Missing < > & characters from article content

Postby padde on Wed Oct 29, 2008 10:35

Umm, could somebody provide a patch for magpie?

I'm trying to package tt-rss for Gentoo, but this is a showstopper (as the bug shows up with the versions of php/libxml2 that are being shipped with Gentoo currently). Until the problems in php/libxml2/wherever are fixed, I'll automatically apply the patch(es) during installation to provide a working tt-rss to Gentoo users.
padde
 
Posts: 7
Joined: Tue Oct 28, 2008 18:30

Re: Missing < > & characters from article content

Postby fox on Wed Oct 29, 2008 11:32

Try adding the same three str_replace() calls after magpierss/rss_parse.inc:158, e.g.

Code: Select all
     xml_set_character_data_handler( $this->parser, 'feed_cdata' );

    // add these three lines
    $source=str_replace("&lt;","&#60;",$source);
    $source=str_replace("&gt;","&#62;",$source);
    $source=str_replace("&amp;","&#38;",$source);
   
    xml_parse(), etc
    ...
User avatar
fox
Site Admin
 
Posts: 1204
Joined: Sat Aug 27, 2005 21:53
Location: Saint-Petersburg, Russia

Re: Missing < > & characters from article content

Postby padde on Wed Oct 29, 2008 15:19

Great, that worked. Thanks :)

I attached the two patches (in -Nur format).
Attachments
patches.tar.gz
(569 Bytes) Downloaded 97 times
padde
 
Posts: 7
Joined: Tue Oct 28, 2008 18:30

Re: Missing < > & characters from article content

Postby fox on Wed Oct 29, 2008 15:53

I'll try to check how unbroken libxml operates with those tomorrow.
User avatar
fox
Site Admin
 
Posts: 1204
Joined: Sat Aug 27, 2005 21:53
Location: Saint-Petersburg, Russia

Re: Missing < > & characters from article content

Postby paulproteus on Mon Dec 01, 2008 0:18

Howdy Fox,

Any updates on this?

It seems to me that the suggested patches should have zero impact on a working libxml2.
paulproteus
 
Posts: 1
Joined: Mon Dec 01, 2008 0:16

Re: Missing < > & characters from article content

Postby fox on Mon Dec 01, 2008 11:20

Oh crap, I forgot all about it. I'll merge the patches into trunk and see whether it breaks stuff for me.
User avatar
fox
Site Admin
 
Posts: 1204
Joined: Sat Aug 27, 2005 21:53
Location: Saint-Petersburg, Russia

Re: Missing < > & characters from article content

Postby Bernd on Wed Dec 03, 2008 12:32

There seems to be some more problems with spezial characters within (image-)links.

The source
Code: Select all
<p><img class="alignnone size-full wp-image-957" title="bestandene Diplompr&#252;fungen an Fachhochschulen" src="http://blog.bernd-distler.net/wp-content/uploads/2008/12/vdi_diplompruefungen.png" alt="" width="480" height="320" /></p>

becomes
umlauts.png
umlauts.png (4.09 KiB) Viewed 1978 times


I have this problem with diferent feeds, not only with my one :|
Bernd
 
Posts: 22
Joined: Tue May 06, 2008 14:46

Next

Return to Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron