Commit eb9fddb7 authored by Romain Courteaud's avatar Romain Courteaud

erp5_core: do one catalog call only

Catalog seems to correctly handle the OR parameters.

Do not lose the possible external `query` parameter.
parent 6c606cf3
from Products.ZSQLCatalog.SQLCatalog import Query, ComplexQuery from Products.ZSQLCatalog.SQLCatalog import Query, ComplexQuery
kw.pop('relative_url', None) kw.pop('relative_url', None)
kw.pop('follow_up_uid', None) kw.pop('follow_up_uid', None)
portal_catalog=context.getPortalObject().portal_catalog
follow_up_related_document_list = portal_catalog(
portal_type=portal_type,
follow_up_uid=context.getUid(), **kw)
kw['query'] = Query(relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_')) portal = context.getPortalObject()
if follow_up_related_document_list: portal_catalog = portal.portal_catalog
kw['query'] = ComplexQuery(
kw['query'], if portal_type is None:
Query(uid=[x.getUid() for x in follow_up_related_document_list]), portal_type = portal.getPortalDocumentTypeList() + portal.getPortalEmbeddedDocumentTypeList()
logical_operator='or')
return portal_catalog(portal_type=portal_type, limit=limit, **kw) document_query = ComplexQuery(
Query(follow_up_uid=context.getUid()),
Query(relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_')),
logical_operator='or'
)
if query is None:
query = document_query
else:
query = ComplexQuery(
query,
document_query,
logical_operator='and'
)
return portal_catalog(
portal_type=portal_type,
query=query,
**kw)
...@@ -50,7 +50,7 @@ ...@@ -50,7 +50,7 @@
</item> </item>
<item> <item>
<key> <string>_params</string> </key> <key> <string>_params</string> </key>
<value> <string>portal_type=(), limit=None, **kw</string> </value> <value> <string>portal_type=None, query=None, **kw</string> </value>
</item> </item>
<item> <item>
<key> <string>id</string> </key> <key> <string>id</string> </key>
......
  • @romain not that you initially added this from scratch, but the created query still seems slow in large instances. While in the same context Metadata tab would be fast. So I wonder if we could simplify here like:

    kw['portal_type'] = context.getPortalObject().getPortalDocumentTypeList()
    context.Base_getRelatedObjectList(**kw)
  • But this not not the same query anymore in such case, and it will not return the same results.

    Anyway, do you know if this script was also slow before my change? If no, it means we have to again split it into 2 queries, while keeping the query parameter support.

  • But this not not the same query anymore in such case, and it will not return the same results.

    But for me semantically it makes sense that Base_getRelatedDocumentList returns what Base_getRelatedObjectList would with an extra filter on portal type. I do not understand why would we need different approach here.

    Anyway, do you know if this script was also slow before my change?

    No, but it is a good idea to check, I will do some benchmarking soon

  • Because Base_getRelatedDocumentList also returns the embedded documents, which Base_getRelatedObjectList does not.

    Maybe what disturb you is the script naming. But changing it seems a bit risky for compatibility reasons.

  • I will do some benchmarking soon

    In my large instance I created also old version, i.e. the version of this script before you your commit (taken from history from b6f4d46d) as Base_getRelatedDocumentListOld. Then I did this code:

    portal = context.getPortalObject()
    person_value = portal.person_module[...] # Here a custom URL of a Person for which this query should return nothing
    
    start_time = DateTime()
    document_list = person_value.Base_getRelatedDocumentList()
    print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400)
    print 'Number of document found %s' % len(document_list)
    
    start_time = DateTime()
    document_list = person_value.Base_getRelatedDocumentListOld()
    print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400)
    print 'Number of document found %s' % len(document_list)
    
    
    start_time = DateTime()
    document_list = person_value.Base_getRelatedDocumentListOld(portal_type=portal.getPortalDocumentTypeList())
    print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400)
    print 'Number of document found %s' % len(document_list)
    
    return printed

    This outputs (tried several times, so times in seconds here are indicative, I get similar results each time):

    Query took in sec: 6.832277
    Number of document found 0
    Query took in sec: 0.009946
    Number of document found 10
    Query took in sec: 0.007806
    Number of document found 0

    So:

    • New code does look many times slower
    • Old code, returns 10 documents that are the sub-objects no matter the type. This I find wrong for a script name Base_getRelatedDocumentList. I checked and Base_viewDocumentList/listbox has a TALES on portal_type:
    python: [[t, t] for t in list(context.getPortalDocumentTypeList())+list(context.getPortalEmbeddedDocumentTypeList())]

    so it works in this case, but for me it does seem wrong

    • Old code with portal_type filtering is also faster.

    Because Base_getRelatedDocumentList also returns the embedded documents, which Base_getRelatedObjectList does not.

    OK, so it is the sub-objects we care mostly, I see this.

    Maybe what disturb you is the script naming. But changing it seems a bit risky for compatibility reasons.

    Yes, but I see the worry and it was not added in your commit here so OK for this.

  • Hi.

    Did you check the above? W is your opinion? I would change it on my fork (most probably put previous version) if we want to keep this version in generic

  • Can you try to do 2 queries instead of 1 in the new Base_getRelatedDocumentList logic?

  • Something like the below maybe (plus removing query from the arguments of the scipt)?

    It is fast in my testing (though still I have testing only on Person with no related documents):

    kw.pop('relative_url', None)
    kw.pop('follow_up_uid', None)
    
    portal = context.getPortalObject()
    portal_catalog = portal.portal_catalog
    
    related_document_list = portal_catalog(
      portal_type=portal.getPortalDocumentTypeList() if portal_type is None else portal_type,
      follow_up_uid=context.getUid(),
      **kw
    )
    
    embedded_document_list = portal.portal_catalog(
      portal_type=portal.getPortalEmbeddedDocumentTypeList() if portal_type is None else portal_type,
      relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_'),
      **kw
    )
    return [x for x in related_document_list] + [x for x in embedded_document_list]

    EDIT: typo

    Edited by Georgios Dagkakis
  • No. I need the query argument. The query argument allows to filter the list of the returned documents.

    And we can't concatenate related_document_list and embedded_document_list, because it will not respect the limit parameter.

  • And we can't concatenate related_document_list and embedded_document_list, because it will not respect the limit parameter.

    True, and sorting will not work either.

    So we would need second query to use the results of first with an or, similar to old version but also keeping the Do not lose the possible external query parameter..

    Something like the below?

    from Products.ZSQLCatalog.SQLCatalog import Query, ComplexQuery
    kw.pop('relative_url', None)
    kw.pop('follow_up_uid', None)
    portal = context.getPortalObject()
    portal_catalog = portal.portal_catalog
    if portal_type is None:
      portal_type = portal.getPortalDocumentTypeList() + portal.getPortalEmbeddedDocumentTypeList()
    
    follow_up_related_document_list = portal_catalog(
               portal_type=portal_type,
               follow_up_uid=context.getUid(), **kw)
    
    if follow_up_related_document_list:
      document_query = ComplexQuery(
        Query(relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_')),
        Query(uid=[x.getUid() for x in follow_up_related_document_list]),
        logical_operator='or'
      )
    else:
      document_query = Query(relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_'))
    
    if query is None:
      query = document_query
    else:
      query = ComplexQuery(
        query,
        document_query,
        logical_operator='and'
      )
    
    return portal_catalog(portal_type=portal_type, limit=limit, **kw)
  • Yes, it looks like what I have in mind.

    Did you check the performance with this version?

  • I think that the case of documents being contained as sub-documents is very rare these days, in most cases we only have documents as follow up, so maybe we should first search for relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_') and only combine the queries if this yields something.

    Also please check this comment wendelin!81 (comment 128200) brain.getUid() is a "coding crime".

  • I think that the case of documents being contained as sub-documents is very rare these days, in most cases we only have documents as follow up, so maybe we should first search for relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_') and only combine the queries if this yields something.

    If this is the case indeed

    Also please check this comment wendelin!81 (comment 128200) brain.getUid() is a "coding crime".

    Good catch thanks. Then:

    Query(uid=[x.uid for x in ...])
  • I think that the case of documents being contained as sub-documents is very rare these days, in most cases we only have documents as follow up, so maybe we should first search for relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_') and only combine the queries if this yields something.

    If this is the case indeed

    Mmm, strangely enough this does not work. I have a document with Embedded File so the first query:

    embedded_document_list = portal_catalog(
      relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_'),
      portal_type=portal_type,
      **kw
    )

    returns a document with uid 70056

    Then the second query, having no query in kw is contructed like:

    portal_catalog(
      portal_type=portal_type,
      query=ComplexQuery(
        Query(follow_up_uid=context.getUid()),
        Query(uid=[x.uid for x in embedded_document_list]),
        logical_operator='or'
      ),
      **kw
    )

    stripping security_uids etc looks like:

    SELECT DISTINCT
       catalog.path,   catalog.uid    
    FROM
          (
        catalog AS `catalog`
      INNER JOIN
        category AS `related_follow_up_uid_1_category`
      ON
        related_follow_up_uid_1_category.base_category_uid = 1896 AND
    related_follow_up_uid_1_category.uid = catalog.uid
      )
      
    WHERE 
      1 = 1 
      AND (`catalog`.`portal_type` IN ('Drawing', 'Discussion Thread', 'Spreadsheet', 'File', 'Image', 'PDF', 'Web Page', 'Presentation', 'Sound', 'Test Page', 'Text', 'Video', 'External Video', 'Web Illustration', 'Web Manifest', 'Web Script', 'Web Style', 'Web Table', 'Embedded File')
      AND (`related_follow_up_uid_1_category`.`category_uid` = 89017
      OR `catalog`.`uid` = 70056));

    which returns nothing. To simplify, the following:

    SELECT 
      count(*)
    FROM
          (
        catalog AS `catalog`
      INNER JOIN
        category AS `related_follow_up_uid_1_category`
      ON
        related_follow_up_uid_1_category.base_category_uid = 1896 AND
    related_follow_up_uid_1_category.uid = catalog.uid
      )
      
    WHERE 
      `catalog`.`uid` = 70056;

    gives 0

    While the uid exists, i.e.:

    SELECT 
      count(*)
    FROM
      catalog
    where uid = 70056;

    gives 1.

    So it is the join on follow_up that breaks it?

    If I have the first query on follow_up and second on sub-documents, I do not get this issue

  • In this case, we should maybe use left_join_list=['follow_up_uid']

  • BTW, I find that current version we have in master also has the issue I noted above, Embedded Documents are not found.

    In this case, we should maybe use left_join_list=['follow_up_uid']

    You mean like:

    if embedded_document_list:
      document_query = ComplexQuery(
        Query(follow_up_uid=context.getUid()),
        Query(uid=[x.uid for x in embedded_document_list]),
        logical_operator='or'
      )
      kw['left_join_list']=['follow_up_uid']

    because this seems to return too many documents

  • because this seems to return too many documents

    Bah, sorry I was getting the length of the prited query...

    So it works like above, but it is slower under circumstances. I have:

    • Base_getRelatedDocumentListV2 : Search based on follow_up first
    • Base_getRelatedDocumentListV1 : Search based on relative_url first
    • Base_getRelatedDocumentList : current master

    And my code is:

    person_value = portal.person_module[...]
    
    start_time = DateTime()
    document_list = person_value.Base_getRelatedDocumentListV2()
    print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400)
    print 'Number of document found %s' % len(document_list)
    
    start_time = DateTime()
    document_list = person_value.Base_getRelatedDocumentListV1()
    print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400)
    print 'Number of document found %s' % len(document_list)
    
    start_time = DateTime()
    document_list = person_value.Base_getRelatedDocumentList()
    print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400)
    print 'Number of document found %s' % len(document_list)

    Ran on a Person with 1 Embedded File:

    Query took in sec: 0.014952
    Number of document found 1
    Query took in sec: 8.285919
    Number of document found 1
    Query took in sec: 7.622406
    Number of document found 0

    Ran on a Person with 1 follow_up File:

    Query took in sec: 0.023403
    Number of document found 1
    Query took in sec: 0.009755
    Number of document found 1
    Query took in sec: 7.64729
    Number of document found 1

    So Base_getRelatedDocumentListV1 seems slow when we have Embedded File and we do the complet query. Base_getRelatedDocumentListV2 seems more robust in both cases.

  • Ah yes, with left join it is much slower. I also tried something like

      document_query = ComplexQuery(
        Query(follow_up_uid=context.getUid()),
        ComplexQuery(
           Query(uid=[x.uid for x in embedded_document_list]),
           Query(follow_up_uid=None),
           logical_operator='and',
        ),
        logical_operator='or',
      )
      kw['left_join_list'] = ['follow_up_uid']

    but it's not better, so checking first for embedded_document_list was not a good idea and checking first for follow_up_related_document_list is better. Thanks for trying !

  • mentioned in merge request !1373 (merged)

    Toggle commit list
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment