Commits · 315bcde9c07538947798dcdcd8afd77931503294 · Kirill Smelkov / Zope

20 May, 2002 15 commits

Since I did the work to write the inner Okapi scoring loop in C, may as · 315bcde9

Tim Peters authored May 20, 2002

well check it in. This yields an overall 133% speedup on a "hot" search
for 'python' in my python-dev archive (a word that appears in all but
2 documents). For those who read the email, turned out it was a
significant speedup to iterate over an IIBTree's items rather than to
materialize the items into an explicit list first.

This is now within 20% of simply doing "IIBucket(the_IIBTree)" (i.e.,
no arithmetic at all), so there's no significant possibility remaining
for speeding the inner score loop.

315bcde9

setUp(): assign the lexicon to self.lexicon directly rather than · 53b46dc9
Guido van Rossum authored May 20, 2002
```
creating it anonymously and then pulling it out of the zc_index
object.
```
53b46dc9
Always have a splitter. (We'll change this to a choice of splitters · 0ff6d33b
Guido van Rossum authored May 20, 2002
```
once we have more than one on the menu.)
```
0ff6d33b
pt_changePrefs(): the dtprefs_cols/rows arguments could be expressed · d53e1580
Guido van Rossum authored May 20, 2002
```
in percentages; strip the percent sign to avoid a traceback calling
int() when these variables are used.
```
d53e1580

_apply_index(): return None when the query string is empty. · 130af9ce

Guido van Rossum authored May 20, 2002

I'm unclear whether this is really the right thing, but at least this
prevents crashes when nothing is entered in the search box.

130af9ce

index_object(): don't die if obj doesn't have an attribute named · 68957496
Guido van Rossum authored May 20, 2002
```
_fieldname; simply return 0 in this case.
```
68957496
Fix a typo. Since the latest change, this always reported "Globbing · 0a97b655
Guido van Rossum authored May 20, 2002
```
is *disabled*.
```
0a97b655
Remove Michel's personal homepage from the link to the ZopeBook. · 3daabd82
Guido van Rossum authored May 20, 2002

3daabd82
Add Zope Copyright notice. · 90bae6a7
Guido van Rossum authored May 20, 2002

90bae6a7
Add Zope Copyright notice. · 53c5d967
Guido van Rossum authored May 20, 2002
```
Fix typo in docstring.
```
53c5d967

QueryParser.py: · 47bb995d

Guido van Rossum authored May 20, 2002

- Rephrased the description of the grammar, pointing out that the
  lexicon decides on globbing syntax.

- Refactored term and atom parsing (moving atom parsing into a
  separate method).  The previously checked-in version accidentally
  accepted some invalid forms like ``foo AND -bar''; this is fixed.

tests/testQueryParser.py:

- Each test is now in a separate method; this produces more output
  (alas) but makes pinpointing the errors much simpler.

- Added some tests catching ``foo AND -bar'' and similar.

- Added an explicit test class for the handling of stopwords.  The
  "and/" test no longer has to check self.__class__.

- Some refactoring of the TestQueryParser class; the utility methods
  are now in a base class TestQueryParserBase, in a different order;
  compareParseTrees() now shows the parse tree it got when raising an
  exception.  The parser is now self.parser instead of self.p (see
  below).

tests/testZCTextIndex.py:

- setUp() no longer needs to assign to self.p; the parser is
  consistently called self.parser now.

47bb995d

Fix unintended recursion in parseQueryEx(). (Unittests are coming up! · 98607a5c
Guido van Rossum authored May 20, 2002
```
:-)
```
98607a5c
Limit copyright to 2002; none of this code existed last year. · 9491bc84
Guido van Rossum authored May 20, 2002

9491bc84

Refactor the query parser to rely on the lexicon for parsing terms. · b82b2746

Guido van Rossum authored May 20, 2002

ILexicon.py:

  - Added parseTerms() and isGlob().

  - Added get_word(), get_wid() (get_word() is old; get_wid() for symmetry).

  - Reflowed some text.

IQueryParser.py:

  - Expanded docs for parseQuery().

  - Added getIgnored() and parseQueryEx().

IPipelineElement.py:

  - Added processGlob().

Lexicon.py:

  - Added parseTerms() and isGlob().

  - Added get_wid().

  - Some pipeline elements now support processGlob().

ParseTree.py:

  - Clarified the error message for calling executeQuery() on a
    NotNode.

QueryParser.py (lots of changes):

  - Change private names __tokens etc. into protected _tokens etc.

  - Add getIgnored() and parseQueryEx() methods.

  - The atom parser now uses the lexicon's parseTerms() and isGlob()
    methods.

  - Query parts that consist only of stopwords (as determined by the
    lexicon), or of stopwords and negated terms, yield None instead of
    a parse tree node; the ignored term is added to self._ignored.
    None is ignored when combining terms for AND/OR/NOT operators, and
    when an operator has no non-None operands, the operator itself
    returns None.  When this None percolates all the way to the top,
    the parser raises a ParseError exception.

tests/testQueryParser.py:

  - Changed test expressions of the form "a AND b AND c" to "aa AND bb
    AND cc" so that the terms won't be considered stopwords.

  - The test for "and/" can only work for the base class.

tests/testZCTextIndex.py:

  - Added copyright notice.

  - Refactor testStopWords() to have two helpers, one for success, one
    for failures.

  - Change testStopWords() to require parser failure for those queries
    that have only stopwords or stopwords plus negated terms.

  - Improve compareSet() to sort the sets of keys, and use a more
    direct way of extracting the keys.  This wasn't strictly needed
    (nothing fails without this), but the old approach of copying the
    keys into a dict in a loop depends on the dict hashing to always
    return keys in the same order.

b82b2746

revert stopper setup.py-age; stopper is not in the Zope module. ok · 5f66a3ce
Matt Behrens authored May 20, 2002
```
guido@.

when/if merge day comes for the installer this will make for less
confusion :-)
```
5f66a3ce

19 May, 2002 6 commits
- For queries, show the total number of results as well as the nbest number; · 7b3de8db
  Tim Peters authored May 19, 2002
```
display the search time in milliseconds too.
```
  7b3de8db
- Show index and pack times in minutes instead of seconds. Show timestamps · f357f8a6
  Tim Peters authored May 19, 2002
```
for start and end of run.  Show elapsed wall-clock time in minutes.
```
  f357f8a6
- Gave it a "-c NNN" context argument (how many leading lines of result · 5da9eb6b
  Tim Peters authored May 19, 2002
```
msgs to display).  Changed the module docstring to separate the index-
generation args from the query args.
```
  5da9eb6b
- Oops! Call the right routine (typo in code just checked in). · a0360090
  Tim Peters authored May 19, 2002
  
  a0360090
- Beef up the reindexing tests to check that they actually fail before the · 94b452e8
  Tim Peters authored May 19, 2002
```
original doc text gets restored.
```
  94b452e8
- QueryParser refactoring step 1: add the lexicon to the constructor args. · bd532bbe
  Guido van Rossum authored May 19, 2002
  
  bd532bbe
18 May, 2002 5 commits
- Rearrange the Okapi reindexing tests to make it easier to figure out what · 97fbb9c9
  Tim Peters authored May 18, 2002
```
went wrong if they fail.
```
  97fbb9c9
- Restore CONTEXT to its original value. · 466d0130
  Tim Peters authored May 18, 2002
  
  466d0130
- Revert braindead change to final pack (it was my change, so it's OK for · f835a0c2
  Tim Peters authored May 18, 2002
```
me to call it braindead <wink>).
```
  f835a0c2
- Pack at the end even if the # of msgs isn't an exact multiple of · 1e8f93fb
  Tim Peters authored May 18, 2002
```
PACK_INTERVAL.
```
  1e8f93fb
- Display total pack time at the end. · eb8de680
  Tim Peters authored May 18, 2002
  
  eb8de680
17 May, 2002 14 commits

Special-case None search() results in AND, AND NOT, and OR contexts, and · dfbfbe55

Tim Peters authored May 17, 2002

uncomment the test cases that were failing in these contexts.

Read it and weep <wink>:  In an AND context, None is treated like the
universal set, which jibes with the convenient fiction that stop words
appear in every doc.  However, in AND NOT and OR contexts, None is
treated like the empty set, which doesn't jibe with anything except that
we want

    real_word AND NOT stop_word

and

    real_word OR stop_word

to act like

    real_word

If we treated None as if it were the universal set, these results would
be (respectively) the empty set and the universal set instead.

At a higher level, we *are* consistent with the notion that a query with
a stop word acts the same as if the clause with the stop word weren't
present.  That's what really drives this schizophrenic (context-dependent)
treatment of None.

dfbfbe55

Use the same stop list for both indexes. · f968ebb5
Jeremy Hylton authored May 17, 2002

f968ebb5
testDocUpdate(): assert that the common and unique wordsets aren't · 138b3120
Tim Peters authored May 17, 2002
```
empty.
```
138b3120
Added more little OOV query tests. · 4fe5e70c
Tim Peters authored May 17, 2002

4fe5e70c

Added a number of tests to trigger search-can-return-None bugs. The three · f4e63c3e

Tim Peters authored May 17, 2002

tests that currently fail are currently commented out.

Key question:  If someone does a search on a stopword, and nothing else is
in the query, what do we want to do?  Return all docs in a random order?
Return no docs?  Raise an exception?

Second question:  What if someone does a query on

    rare_word AND NOT stop_word

?

f4e63c3e

If -T is passed (query with old TextIndex), try as best as possible to · 86e12d94
Jeremy Hylton authored May 17, 2002
```
do the same query and work as ZCTextIndex would do.

Produce a result set, pump it into NBest, and extract the 10 best.
```
86e12d94
Reindex docs touching as few docid->w(docid, w) maps as possible. · 86fc53ee
Tim Peters authored May 17, 2002

86fc53ee
Add a little splitter that behaves pretty much like HTMLWordSplitter, · bad257b8
Jeremy Hylton authored May 17, 2002
```
but works with a TextIndex Lexicon.
```
bad257b8

_del_wordinfo(): Simplify. It's the caller's responsibility to ensure that · 81682acc

Tim Peters authored May 17, 2002

the index knows about the doc and the wid.

_del_wordinfo and _add_wordinfo:  s/map/doc2score/g.  map is a builtin
function, and it's needlessly confusing to name a vrbl that too.

81682acc

Improve OOV explanation, based on Guido's feedback. · 92c26bc8
Tim Peters authored May 17, 2002

92c26bc8
Implement unique using an IITreeSet as suggested by Tim. · 9b736188
Jeremy Hylton authored May 17, 2002

9b736188

Make sure stop words are used with old TextIndex. · 0d93f320

Jeremy Hylton authored May 17, 2002

I think that the default Lexicon for TextIndex does not use a stop
word list. For the comparison with ZCTextIndex, explicitly pass the
default stop word dict from TextIndex to the lexicon.

0d93f320

Shorten comment so it fits on line. · 8915733b
Jeremy Hylton authored May 17, 2002

8915733b

Two changes and a question posing as a comment. · 504af04c

Jeremy Hylton authored May 17, 2002

In unindex_doc() call _del_wordinfo() for each unique wid in the doc,
not for each wid.  Before we had WidCode and phrase searching,
_docwords stored a list of the unique wids.  The unindex code wasn't
updated when _docwords started storing all the wids, even duplicates.

Replace the try/except around __getitem__ in _add_wordinfo() with a
.get() call.

Add XXX comment about the purpose of the try/except(s) in
_del_wordinfo().  I suspect they only existed because _del_wordinfo()
was called repeatedly when a wid existed more than once.

504af04c