DB Article Duplication

Support requests, bug reports, etc. go here. Dedicated servers / VDS hosting only
TSM
Bear Rating Trainee
Bear Rating Trainee
Posts: 13
Joined: 03 Nov 2015, 16:20

DB Article Duplication

Postby TSM » 03 Nov 2015, 16:56

I have recently migrated from Feed-On-Feeds which served our purposes for a long time and had many updates done to support our needs fixing some performance issues.
One of the things I found fairly useful was that articles were not duplicated within the database if many users were subscribed to the same article.
It seems that TTRSS does duplicate them even though though the user_entities it has the possibility for linking back to the same original article. Is this an oversight/problem or that the code has not reached that point yet.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: DB Article Duplication

Postby fox » 03 Nov 2015, 17:57

deduplication is done based on article guids

so basically depends on your feeds

TSM
Bear Rating Trainee
Bear Rating Trainee
Posts: 13
Joined: 03 Nov 2015, 16:20

Re: DB Article Duplication

Postby TSM » 03 Nov 2015, 19:20

Looking at line 616 in rssfuncs.php it seems that the $entry_guid is compiled from the $owner_uid and $entry_guid, this would make the article entry unique even if several users had the same feed.
Maybe instead it should be like this.

Original

Code: Select all

$entry_guid = $item->get_id();
if (!$entry_guid) $entry_guid = $item->get_link();
if (!$entry_guid) $entry_guid = make_guid_from_title($item->get_title());
if (!$entry_guid) continue;

$entry_guid = "$owner_uid,$entry_guid";

$entry_guid_hashed = db_escape_string('SHA1:' . sha1($entry_guid));


Proposed, the owner only being concatenated with the GUID if the article $entry_guid was derived from the title as then it could generate false positive.

Code: Select all

$entry_guid = $item->get_id();
if (!$entry_guid) $entry_guid = $item->get_link();
if(!$entry_guid){
   if (!$entry_guid) $entry_guid = make_guid_from_title($item->get_title());
   if (!$entry_guid) continue;
   $entry_guid = "$owner_uid,$entry_guid";
}

$entry_guid_hashed = db_escape_string('SHA1:' . sha1($entry_guid));


ps. The raw GUID is not stored in the DB so working this out retrospectively is more difficult.

JustAMacUser
Bear Rating Overlord
Bear Rating Overlord
Posts: 373
Joined: 20 Aug 2013, 23:13

Re: DB Article Duplication

Postby JustAMacUser » 03 Nov 2015, 19:40

Do keep in mind that different users can apply different plugins which could result in different article content. One user might use the Readability plugin, another might not; therefore the content would be different for the same article. This type of thing needs to be taken into consideration.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: DB Article Duplication

Postby fox » 03 Nov 2015, 19:49

your patch is bad, op

e: you can obviously change whatever just pls don't ask me to merge your broken stuff in trunk

TSM
Bear Rating Trainee
Bear Rating Trainee
Posts: 13
Joined: 03 Nov 2015, 16:20

Re: DB Article Duplication

Postby TSM » 03 Nov 2015, 20:02

JustAMacUser wrote:Do keep in mind that different users can apply different plugins which could result in different article content. One user might use the Readability plugin, another might not; therefore the content would be different for the same article. This type of thing needs to be taken into consideration.


ahh

Could the content_hash or $entry_plugin_data be used to work out if a user is applying filters then only match articles if everything matches. Or have a flag to indicate that the article is altered from stock then use that to handle the deduplication, if user has custom filters its an 'altered' article else its a 'original' article and can be deduped.

AND/OR

A flag in the preferences which can disable all article filters and allow dedupe to work across the whole system.

TSM
Bear Rating Trainee
Bear Rating Trainee
Posts: 13
Joined: 03 Nov 2015, 16:20

Re: DB Article Duplication

Postby TSM » 03 Nov 2015, 20:02

fox wrote:your patch is bad, op

e: you can obviously change whatever just pls don't ask me to merge your broken stuff in trunk


it was not a patch, a proposal for discussion

JustAMacUser
Bear Rating Overlord
Bear Rating Overlord
Posts: 373
Joined: 20 Aug 2013, 23:13

Re: DB Article Duplication

Postby JustAMacUser » 03 Nov 2015, 20:12

TSM wrote:Could the content_hash or $entry_plugin_data be used to work out if a user is applying filters then only match articles if everything matches. [...] and allow dedupe to work across the whole system.


Honestly, that seems like a lot of extra code and processing for something that isn't that much of an issue. You're trading CPU cycles for SQL storage. One way or another a resource is going to be used. Since the current method works flawlessly, why not keep it?

(Plugins are one of the great features of TT-RSS and the ability to manipulate articles as they come in is a feature many use; therefore, the extra code wouldn't yield any improvements for many users.)

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: DB Article Duplication

Postby fox » 03 Nov 2015, 20:44

TSM wrote:
JustAMacUser wrote:Do keep in mind that different users can apply different plugins which could result in different article content. One user might use the Readability plugin, another might not; therefore the content would be different for the same article. This type of thing needs to be taken into consideration.


ahh

Could the content_hash or $entry_plugin_data be used to work out if a user is applying filters then only match articles if everything matches. Or have a flag to indicate that the article is altered from stock then use that to handle the deduplication, if user has custom filters its an 'altered' article else its a 'original' article and can be deduped.

AND/OR

A flag in the preferences which can disable all article filters and allow dedupe to work across the whole system.


please don't post anymore of your ideas they make my head hurt real bad

it's not 1998 anymore, storage is cheap, anyway listen to JustAMacUser and stop posting thanks

TSM
Bear Rating Trainee
Bear Rating Trainee
Posts: 13
Joined: 03 Nov 2015, 16:20

Re: DB Article Duplication

Postby TSM » 03 Nov 2015, 20:47

JustAMacUser wrote:
TSM wrote:Could the content_hash or $entry_plugin_data be used to work out if a user is applying filters then only match articles if everything matches. [...] and allow dedupe to work across the whole system.


Honestly, that seems like a lot of extra code and processing for something that isn't that much of an issue. You're trading CPU cycles for SQL storage. One way or another a resource is going to be used. Since the current method works flawlessly, why not keep it?

(Plugins are one of the great features of TT-RSS and the ability to manipulate articles as they come in is a feature many use; therefore, the extra code wouldn't yield any improvements for many users.)


I agree the filters is a useful feature.

In certain scenarios the lack of site wide de-duplication could be worrying.
Our current FOF db runs at about 5GB for about 8 users (with de-dupe), we store about 1 years worth of feeds, many people have the same feeds so on that assumption in TT-RSS we could be looking at a storage of around 30-40GB, so large growth can happen.
Sphinx could also have a knock on effect with long index times and large index sizes (ive already posted a patch to allow it to do deltas more efficiently)
Sphinx index query would need to be modified to suit.
Purge facility would need to work on a per user and also globally.

I can look into the code required to make this all possible.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: DB Article Duplication

Postby fox » 03 Nov 2015, 20:48

> we store about 1 years worth of feeds,

lmao

> I can look into the code required to make this all possible.

no thanks

e: lol at 40 gigs of articles for eight users oh god


Return to “Support”

Who is online

Users browsing this forum: No registered users and 2 guests