Importing of Google Reader Cached Feeds

Request new functionality here

Importing of Google Reader Cached Feeds

Postby gbcox » 30 Apr 2013, 16:36

Google Reader allows you to extract feed history, up to a limit of 1000 articles.

Here are two articles which explain how to extract the data:
http://googlesystem.blogspot.com.br/200 ... oogle.html
http://ashleyangell.com/2011/01/export- ... le-reader/

It would be nice to be able to import this information into the ttrss database. I've seen where you can import starred or shared items, but that isn't what I'm talking about. I'm interested in importing an entire feed history (my understanding is that this is limited to 1000 articles, but that is sufficient for most purposes) into the active database so I can search for information. So the end result would be for each feed you imported, you would start with 1000 articles from google cache. Then you would continue to add by the normal update procedure.

This possibly could also be adapted by people who are wanting to switch from rssowl, etc. to ttrss without losing their feed history.
gbcox
Bear Rating Master
Bear Rating Master
 
Posts: 145
Joined: 25 Apr 2013, 00:52

Re: Importing of Google Reader Cached Feeds

Postby KestL » 03 May 2013, 07:39

Hi!
Here you can find a manual how to extract more then 1000 items.
KestL
Bear Rating Trainee
Bear Rating Trainee
 
Posts: 2
Joined: 03 May 2013, 07:35

Re: Importing of Google Reader Cached Feeds

Postby robinmarlow » 21 May 2013, 10:01

From a brief look at the XML produced by these methods, it looks as if it wouldn't be too hard to mangle it so that the XML import plugin would cope with it.
I'll try and dust off my perl / look at the plugin this weekend.

R
robinmarlow
Bear Rating Trainee
Bear Rating Trainee
 
Posts: 11
Joined: 21 May 2013, 09:58

Re: Importing of Google Reader Cached Feeds

Postby lotrfan » 21 May 2013, 20:18

A quick perl script, just to download the feeds... parsing will have to be done later.
https://gist.github.com/lotrfan/28c4a266468bb7658e95,

To use, (on Linux...)
1. Download the gist, and save it somewhere on your hard drive
2. Run
Code: Select all
./reader.pl <prefix> <feed url>

i.e.,
Code: Select all
./reader.pl 'google_blog' 'http://googleblog.blogspot.com/feeds/posts/default'

3. Log in... if you use two-factor authentication, either turn it off, create an app-specific password, or just create a new Google account
Don't blame me if Google gets mad at you for this.
4. When you're done, you might want to delete

The example above will create 'google_blog-01.xml', 'google_blog-02.xml', etc. containing all of the articles.

Some possible issues:
  • Running on Windows... I might decide to make a PHP version of this, so it could run on everthing TT-RSS runs on
  • On one of the links mentioned above, someone said that the method used doesn't always work; this didn't seem to be an issue when I was testing it, though.
  • You might be missing some of the perl modules used (they should all be on CPAN)
  • It seems like, for the example above, not all of the articles are returned, but they go back to 2007ish.

Feel free to comment/suggest improvements.
lotrfan
Bear Rating Disaster
Bear Rating Disaster
 
Posts: 71
Joined: 18 Mar 2013, 00:42

Re: Importing of Google Reader Cached Feeds

Postby robinmarlow » 21 May 2013, 21:55

Crikey nice job! That would have taken me a while to code up.
My only suggestion for improvement would be to let it parse the subscriptions.xml that google take out produces to get the feed names.... but that isn't really needed unless you've got a lot of feeds.

Further poking around in the xml import plugin (plugins/googlereaderimport/init.php) looks like shouldn't need much modification to let it parse these files.
robinmarlow
Bear Rating Trainee
Bear Rating Trainee
 
Posts: 11
Joined: 21 May 2013, 09:58

Re: Importing of Google Reader Cached Feeds

Postby robinmarlow » 23 May 2013, 22:21

Here is a parsing hack:
https://gist.github.com/robinmarlow/b2189510b5dea45a9597

this will take the xml from either lotrfan's reader.pl or the manual way gbcox mentioned & make it into an xml that the import/export plugin can use

run (linux again ;o)

php readerXML_to_ttrssXML.php input.xml > output.xml

you may need to gzip the output to get the plugin to manage to load it.

Works for the 3 feeds i've tried so far - but there could be others that don't play nice.
let me know how you get on.

Robin
robinmarlow
Bear Rating Trainee
Bear Rating Trainee
 
Posts: 11
Joined: 21 May 2013, 09:58

Re: Importing of Google Reader Cached Feeds

Postby gbcox » 26 May 2013, 19:20

I tried the parsing hack, the text is displaying, but none of the photos...
gbcox
Bear Rating Master
Bear Rating Master
 
Posts: 145
Joined: 25 Apr 2013, 00:52

Re: Importing of Google Reader Cached Feeds

Postby robinmarlow » 26 May 2013, 21:34

Half way there then ;o)

What is the feed address? I'll take a look and see what's going wrong.

R
robinmarlow
Bear Rating Trainee
Bear Rating Trainee
 
Posts: 11
Joined: 21 May 2013, 09:58

Re: Importing of Google Reader Cached Feeds

Postby gbcox » 27 May 2013, 01:46

gbcox
Bear Rating Master
Bear Rating Master
 
Posts: 145
Joined: 25 Apr 2013, 00:52

Re: Importing of Google Reader Cached Feeds

Postby lotrfan » 27 May 2013, 06:43

It looks like the plugin strips all HTML tags from the XML before inserting into it into the database. A quick fix:
In plugins/import_export/init.php, change (around line 238):
The change is also on my Github account
Code: Select all
                  if ($child->nodeName != 'label_cache')
                     $article[$child->nodeName] = db_escape_string($child->nodeValue);
                  else
                     $article[$child->nodeName] = $child->nodeValue;

to
Code: Select all
                  switch ($child->nodeName) {
                  case 'label_cache':
                     $article[$child->nodeName] = $child->nodeValue;
                     break;
                  case 'content':
                     $article[$child->nodeName] = db_escape_string($child->nodeValue, false);
                     break;
                  default:
                     $article[$child->nodeName] = db_escape_string($child->nodeValue);
                  }

or
Code: Select all
                  if ($child->nodeName != 'label_cache')
                     $article[$child->nodeName] = db_escape_string($child->nodeValue, false);
                  else
                     $article[$child->nodeName] = $child->nodeValue;


The first way will only leave tags in the content, the second will leave them in everything. It probably doesn't really matter which one you pick.
lotrfan
Bear Rating Disaster
Bear Rating Disaster
 
Posts: 71
Joined: 18 Mar 2013, 00:42

Re: Importing of Google Reader Cached Feeds

Postby lotrfan » 27 May 2013, 06:49

By the way, nice job on the parser, robinmarlow! I've only tried it on the above feed (on which it works beautifully), as I haven't decided if I'm going to import all of my old feeds... My TT-RSS server isn't the most powerful machine; I'm not sure it can handle the tens of thousands of more articles, as some queries already take awhile...
lotrfan
Bear Rating Disaster
Bear Rating Disaster
 
Posts: 71
Joined: 18 Mar 2013, 00:42

Re: Importing of Google Reader Cached Feeds

Postby gbcox » 27 May 2013, 17:47

@lotrfan - Thanks, I went to the plugins/import_export directory, renamed init.php to init.php.dist and replaced with your version of init.php. Tested it on three feeds and all work fine now. Checked the error log and no errors there either. @Fox do you think this change could be folded into the trunk or you think it better to create a separate plugin?
gbcox
Bear Rating Master
Bear Rating Master
 
Posts: 145
Joined: 25 Apr 2013, 00:52

Re: Importing of Google Reader Cached Feeds

Postby recognitium » 02 Jul 2013, 08:44

@lotrfan and @robinmarlow, thanks for this great efforts. It seems I could use it to do a similar thing that I wanted to do.

You see, I downloaded all my tagged items as some format of xml (one xml per tag) using feed-archive tool from http://readerisdead.com/

One of the files I obtained from the tool above was:

http://www.imperialismoedependencia.org/archivedtags/labelado2.xml

My idea would be to import them, since it's my own old curated news. First try was to upload them and subscribe to the "dead" feeds. Even though I set up a purge = 0 and 45000 hours to new items limit (I am starting this user from scratch, and I want this to be properly setup first before adding "alive" feeds), none shows.

After that I found your posts. I changed init.php as indicated, and tried using @robinmarlow 's parsing hack to the files I downloaded. No luck also, the parser just generates 42 B files .

You can laught at me... obviously I am totally lost at this jungle, and, unfortunately, my computer and web skills are pretty low. Meanwhile I'll try to understand enough PHP to hack my own parser, but any help I could get would be great.
User avatar
recognitium
Bear Rating Trainee
Bear Rating Trainee
 
Posts: 14
Joined: 01 Jul 2013, 21:35

Re: Importing of Google Reader Cached Feeds

Postby robinmarlow » 02 Jul 2013, 10:34

Looking at your xml it seems to have "atom:" appended to all the tags which confuses the parser.

as a quick hack I tried removing all the "atom:" using
Code: Select all
sed -e 's/atom://g' labelado2.xml > test.xml

(or a text editor searching for "atom:" replacing with "" would do just as well)

I've updated the gist https://gist.github.com/robinmarlow/b2189510b5dea45a9597 schema value to 121 for version 1.8.

that seems to work and imports 21 articles for that feed but is a quick and dirty hack!

Robin
robinmarlow
Bear Rating Trainee
Bear Rating Trainee
 
Posts: 11
Joined: 21 May 2013, 09:58

Re: Importing of Google Reader Cached Feeds

Postby recognitium » 02 Jul 2013, 14:57

Thanks a lot. I will try it now

---- UPDATE ---

As far as I could see, it worked like a charm!

I had 166 archived tags downloaded like that, almost all of them were properly imported. I just made a rudimentar bash script for the sed and hack for all the files. Thank you so much @robinmarlow !!

Only few that had an error were big ones (exceeding several Mb's). How was it that I could gzip them? As noob as I am, I tried to upload directly the archives as "tar.gz" , and, of course, the import-export plugin couldn't read the XML.
User avatar
recognitium
Bear Rating Trainee
Bear Rating Trainee
 
Posts: 14
Joined: 01 Jul 2013, 21:35

Next

Return to Feature requests

Who is online

Users browsing this forum: No registered users and 1 guest

cron