engine
Pythonic wrapper around PyLucene search engine.
Provides high-level interfaces to indexes and documents,
abstracting away java lucene primitives.
indexers
Wrappers for lucene Index{Read,Search,Writ}ers.
The final Indexer classes exposes a high-level Searcher and Writer.
TokenFilter
-
class lupyne.engine.indexers.TokenFilter(input)[source]
Create an iterable lucene TokenFilter from a TokenStream.
Subclass and override incrementToken().
Attributes are cached as properties to create a Token interface.
-
incrementToken()[source]
Advance to next token and return whether the stream is not empty.
-
offset[source]
Start and stop character offset.
-
payload[source]
Payload bytes.
-
positionIncrement[source]
Position relative to the previous token.
-
term[source]
Term text.
-
type[source]
Lexical type.
Analyzer
-
class lupyne.engine.indexers.Analyzer(tokenizer, *filters)[source]
Return a lucene Analyzer which chains together a tokenizer and filters.
Parameters: |
- tokenizer – lucene Tokenizer or Analyzer
- filters – lucene TokenFilters
|
-
parse(query, field='', op='', version='', parser=None, **attrs)[source]
Return parsed lucene Query.
Parameters: |
- query – query string
- field – default query field name, sequence of names, or boost mapping
- op – default query operator (‘or’, ‘and’)
- version – lucene Version string
- parser – custom PythonQueryParser class
- attrs – additional attributes to set on the parser
|
-
tokens(text, field=None)[source]
Return lucene TokenStream from text.
IndexReader
-
class lupyne.engine.indexers.IndexReader(reader)[source]
Delegated lucene IndexReader, with a mapping interface of ids to document objects.
Parameters: | reader – lucene IndexReader |
-
__len__()[source]
-
__contains__(id)[source]
-
__iter__()[source]
-
__getitem__(id)[source]
-
comparator(name, type='string', parser=None)[source]
Return cache of field values suitable for sorting.
Parsing values into an array is memory optimized.
Map values into a list for speed optimization.
Parameters: |
- name – field name
- type – type object or name compatible with FieldCache
- parser – lucene FieldCache.Parser or callable applied to field values
|
-
copy(dest, query=None, exclude=None, optimize=False)[source]
Copy the index to the destination directory.
Optimized to use hard links if the destination is a file system path.
Parameters: |
- dest – destination directory path or lucene Directory
- query – optional lucene Query to select documents
- exclude – optional lucene Query to exclude documents
- optimize – optionally optimize destination index
|
-
count(name, value)[source]
Return number of documents with given term.
-
directory[source]
reader’s lucene Directory
-
docs(name, value, counts=False)[source]
Generate doc ids which contain given term, optionally with frequency counts.
-
morelikethis(doc, *fields, **attrs)[source]
Return MoreLikeThis query for document.
Parameters: |
- doc – document id or text
- fields – document fields to use, optional for termvectors
- attrs – additional attributes to set on the morelikethis object
|
-
names(option='all')[source]
Return field names, given option description.
-
numbers(name, step=0, type=<type 'int'>, counts=False)[source]
Generate decoded numeric term values, optionally with frequency counts.
Parameters: |
- name – field name
- step – precision step to select terms
- type – int or float
- counts – include frequency counts
|
-
overlap(left, right)[source]
Return intersection count of cached filters.
-
positions(name, value, payloads=False)[source]
Generate doc ids and positions which contain given term, optionally only with payloads.
-
positionvector(id, field, offsets=False)[source]
Generate terms and positions for given doc id and field, optionally with character offsets.
-
segments[source]
segment filenames with document counts
-
spans(query, positions=False, payloads=False)[source]
Generate docs with occurrence counts for a span query.
Parameters: |
- query – lucene SpanQuery
- positions – optionally include slice positions instead of counts
- payloads – optionally only include slice positions with payloads
|
-
terms(name, value='', stop=None, counts=False, **fuzzy)[source]
Generate a slice of term values, optionally with frequency counts.
Supports a range of terms, wildcard terms, or fuzzy terms.
Parameters: |
- name – field name
- value – initial term text or wildcard
- stop – optional upper bound for simple terms
- counts – include frequency counts
- fuzzy – optional keyword arguments for fuzzy terms
|
-
termvector(id, field, counts=False)[source]
Generate terms for given doc id and field, optionally with frequency counts.
-
timestamp[source]
timestamp of reader’s last commit
IndexSearcher
-
class lupyne.engine.indexers.IndexSearcher(directory, analyzer=None)[source]
Bases: IndexSearcher, lupyne.engine.indexers.IndexReader
Inherited lucene IndexSearcher, with a mixed-in IndexReader.
Parameters: |
- directory – directory path, lucene Directory, or lucene IndexReader
- analyzer – lucene Analyzer, default StandardAnalyzer
|
-
__getitem__(id)[source]
Return Document.
-
__del__()[source]
Closes index.
-
filters
Mapping of cached filters by field, which are used for facet counts.
-
sorters
Mapping of cached sorters by field and associated parsers.
-
spellcheckers
Mapping of cached spellcheckers by field.
-
comparator(field, type='string', parser=None)[source]
Return IndexReader.comparator() using a cached SortField if available.
-
correct(field, text, distance=2, minSimilarity=0.5)[source]
Generate potential words ordered by increasing edit distance and decreasing frequency.
For optimal performance only iterate the required slice size of corrections.
Parameters: |
- distance – the maximum edit distance to consider for enumeration
- minSimilarity – threshold for additional fuzzy terms after edits have been exhausted
|
-
count(*query, **options)[source]
Return number of hits for given query or term.
Parameters: |
- query – search() compatible query, or optimally a name and value
- options – additional search() options
|
-
distances(lng, lat, lngfield, latfield)[source]
Return distance comparator computed from cached lat/lng fields.
-
facets(query, *keys)[source]
Return mapping of document counts for the intersection with each facet.
Parameters: |
- query – query string, lucene Query, or lucene Filter
- keys – field names, term tuples, or any keys to previously cached filters
|
-
get(id, *fields)[source]
Return Document with only selected fields loaded.
-
highlighter(query, field, **kwargs)[source]
Return Highlighter or if applicable FastVectorHighlighter specific to searcher and query.
-
classmethod load(directory, analyzer=None)[source]
Open IndexSearcher with a lucene RAMDirectory, loading index into memory.
-
match(document, *queries)[source]
Generate scores for all queries against a given document mapping.
-
reopen(filters=False, sorters=False, spellcheckers=False)[source]
Return current IndexSearcher, only creating a new one if necessary.
Parameters: |
- filters – refresh cached facet filters
- sorters – refresh cached sorters with associated parsers
- spellcheckers – refresh cached spellcheckers
|
-
search(query=None, filter=None, count=None, sort=None, reverse=False, scores=False, maxscore=False, timeout=None, **parser)[source]
Run query and return Hits.
Parameters: |
- query – query string or lucene Query
- filter – lucene Filter
- count – maximum number of hits to retrieve
- sort – if count is given, lucene Sort parameters, else a callable key
- reverse – reverse flag used with sort
- scores – compute scores for candidate results when using lucene Sort
- maxscore – compute maximum score of all results when using lucene Sort
- timeout – stop search after elapsed number of seconds
- parser – Analyzer.parse() options
|
-
sorter(field, type='string', parser=None, reverse=False)[source]
Return SortField with cached attributes if available.
-
spellchecker(field)[source]
Return and cache spellchecker for given field.
-
suggest(field, prefix, count=None)[source]
Return ordered suggested words for prefix.
MultiSearcher
-
class lupyne.engine.indexers.MultiSearcher(reader, analyzer=None)[source]
Bases: lupyne.engine.indexers.IndexSearcher
IndexSearcher with underlying lucene MultiReader.
Parameters: |
- reader – directory paths, Directories, IndexReaders, or a single MultiReader
- analyzer – lucene Analyzer, default StandardAnalyzer
|
IndexWriter
-
class lupyne.engine.indexers.IndexWriter(directory=None, mode='a', analyzer=None, version=None)[source]
Bases: IndexWriter
Inherited lucene IndexWriter.
Supports setting fields parameters explicitly, so documents can be represented as dictionaries.
Parameters: |
- directory – directory path or lucene Directory, default RAMDirectory
- mode – file mode (rwa), except updating (+) is implied
- analyzer – lucene Analyzer, default StandardAnalyzer
- version – lucene Version argument passed to IndexWriterConfig or StandardAnalyzer, default is latest
|
-
fields
Mapping of assigned fields. May be used directly, instead of set() method, for further customization.
-
__del__()
Closes index.
-
__len__()
-
__iadd__(directory)[source]
Add directory (or reader, searcher, writer) to index.
-
add(document=(), **terms)[source]
Add document() to index with optional boost.
-
delete(*query, **options)[source]
Remove documents which match given query or term.
-
document(document=(), boost=1.0, **terms)[source]
Return lucene Document from mapping of field names to one or multiple values.
-
set(name, cls=<class 'lupyne.engine.documents.Field'>, **params)[source]
Assign parameters to field name.
Parameters: |
- name – registered name of field
- cls – optional Field constructor
- params – store,index,termvector options compatible with Field
|
-
snapshot(*args, **kwds)[source]
Return context manager of an index commit snapshot.
-
update(name, value='', document=(), **terms)[source]
Atomically delete documents which match given term and add the new document() with optional boost.
Indexer
-
class lupyne.engine.indexers.Indexer(directory=None, mode='a', analyzer=None, version=None, nrt=False)[source]
Bases: lupyne.engine.indexers.IndexWriter
An all-purpose interface to an index.
Creates an IndexWriter with a delegated IndexSearcher.
Parameters: | nrt – optionally use a near real-time searcher |
-
commit(expunge=False, optimize=False, **caches)[source]
Commit writes and refresh() searcher.
Parameters: |
- expunge – expunge deletes
- optimize – optimize index, optionally supply number of segments
|
-
refresh(**caches)[source]
Store refreshed searcher with IndexSearcher.reopen() caches.
documents
Wrappers for lucene Fields and Documents.
Document
-
class lupyne.engine.documents.Document(doc)[source]
Delegated lucene Document.
Provides mapping interface of field names to values, but duplicate field names are allowed.
-
__len__()[source]
-
__contains__(name)[source]
-
__iter__()[source]
-
__getitem__(name)[source]
-
dict(*names, **defaults)[source]
Return dict representation of document.
Parameters: |
- names – names of multi-valued fields to return as a list
- defaults – include only given fields, using default values as necessary
|
-
get(name, default=None)[source]
Return field value if present, else default.
-
getlist(name)[source]
Return list of all values for given field.
-
items()[source]
Generate name, value pairs for all fields.
Hit
-
class lupyne.engine.documents.Hit(doc, id, score)[source]
Bases: lupyne.engine.documents.Document
A Document with an id and score, from a search result.
-
dict(*names, **defaults)[source]
Return dict representation of document with __id__ and __score__.
Hits
-
class lupyne.engine.documents.Hits(searcher, ids, scores, count=None, maxscore=None, fields=None)[source]
Search results: lazily evaluated and memory efficient.
Provides a read-only sequence interface to hit objects.
Parameters: |
- searcher – IndexSearcher which can retrieve documents
- ids – ordered doc ids
- scores – ordered doc scores
- count – total number of hits
- maxscore – maximum score
- fields – optional field selectors
|
-
__len__()[source]
-
__getitem__(index)[source]
-
groupby(func)[source]
Return ordered list of Hits grouped by value of function applied to doc ids.
-
items()[source]
Generate zipped ids and scores.
Field
-
class lupyne.engine.documents.Field(name, store=False, index='analyzed', termvector=False, analyzed=False, omitNorms=False, withPositions=False, withOffsets=False, **attrs)[source]
Saved parameters which can generate lucene Fields given values.
Parameters: |
- name – name of field
- store,index,termvector – field parameters, expressed as bools or strs, with lucene defaults
- analyzed,omitNorms – additional index boolean settings
- withPositions,withOffsets – additional termvector boolean settings
- attrs – additional attributes to set on the field
|
-
items(*values)[source]
Generate lucene Fields suitable for adding to a document.
NestedField
-
class lupyne.engine.documents.NestedField(name, sep='.', index=True, **kwargs)[source]
Bases: lupyne.engine.documents.Field
Field which indexes every component into its own field.
Original value may be stored for convenience.
Parameters: | sep – field separator used on name and values |
-
getname(index)[source]
Return prefix of field name.
-
items(*values)[source]
Generate indexed component fields.
-
join(words)[source]
Return text from separate words.
-
prefix(value)[source]
Return prefix query of the closest possible prefixed field.
-
range(start, stop, lower=True, upper=False)[source]
Return range query of the closest possible prefixed field.
-
split(value)[source]
Return sequence of words from name or value.
NumericField
-
class lupyne.engine.documents.NumericField(name, step=None, store=False, index=True)[source]
Bases: lupyne.engine.documents.Field
Field which indexes numbers in a prefix tree.
Parameters: |
- name – name of field
- step – precision step
|
-
items(*values)[source]
Generate lucene NumericFields suitable for adding to a document.
-
range(start, stop, lower=True, upper=False)[source]
Return lucene NumericRangeQuery.
-
term(value)[source]
Return range query to match single term.
DateTimeField
-
class lupyne.engine.documents.DateTimeField(name, step=None, store=False, index=True)[source]
Bases: lupyne.engine.documents.NumericField
Field which indexes datetimes as a NumericField of timestamps.
Supports datetimes, dates, and any prefix of time tuples.
-
duration(date, days=0, **delta)[source]
Return date range query within time span of date.
Parameters: |
- date – origin date or tuple
- days,delta – timedelta parameters
|
-
items(*dates)[source]
Generate lucene NumericFields of timestamps.
-
prefix(date)[source]
Return range query which matches the date prefix.
-
range(start, stop, lower=True, upper=False)[source]
Return NumericRangeQuery of timestamps.
-
timestamp(date)[source]
Return utc timestamp from date or time tuple.
-
within(days=0, weeks=0, utc=True, **delta)[source]
Return date range query within current time and delta.
If the delta is an exact number of days, then dates will be used.
Parameters: |
- days,weeks – number of days to offset from today
- utc – optionally use utc instead of local time
- delta – additional timedelta parameters
|
queries
Query wrappers and search utilities.
Query
-
class lupyne.engine.queries.Query(base, *args)[source]
Inherited lucene Query, with dynamic base class acquisition.
Uses class methods and operator overloading for convenient query construction.
-
__and__(other)[source]
BooleanQuery +self +other>
-
__or__(other)[source]
BooleanQuery self other>
-
__sub__(other)[source]
BooleanQuery self -other>
-
classmethod all(*queries, **terms)[source]
Return BooleanQuery (AND) from queries and terms.
-
classmethod any(*queries, **terms)[source]
Return BooleanQuery (OR) from queries and terms.
-
classmethod disjunct(multiplier, *queries, **terms)[source]
Return lucene DisjunctionMaxQuery from queries and terms.
-
filter(cache=True)[source]
Return lucene CachingWrapperFilter, optionally just QueryWrapperFilter.
-
classmethod fuzzy(name, value, minimumSimilarity=0.5, prefixLength=0)[source]
Return lucene FuzzyQuery.
-
classmethod multiphrase(name, *values)[source]
Return lucene MultiPhraseQuery. None may be used as a placeholder.
-
classmethod near(name, *values, **kwargs)[source]
Return SpanNearQuery from terms.
Term values which supply another field name will be masked.
-
classmethod phrase(name, *values)[source]
Return lucene PhraseQuery. None may be used as a placeholder.
-
classmethod prefix(name, value)[source]
Return lucene PrefixQuery.
-
classmethod range(name, start, stop, lower=True, upper=False)[source]
Return lucene RangeQuery, by default with a half-open interval.
-
classmethod span(*term)[source]
Return SpanQuery from term name and value or a MultiTermQuery.
-
classmethod term(name, value)[source]
Return lucene TermQuery.
-
terms()[source]
Generate set of query term items.
-
classmethod wildcard(name, value)[source]
Return lucene WildcardQuery.
SpanQuery
-
class lupyne.engine.queries.SpanQuery(base, *args)[source]
Bases: lupyne.engine.queries.Query
Inherited lucene SpanQuery with additional span constructors.
-
__getitem__(slc)[source]
<SpanFirstQuery: spanFirst(self, other.stop)>
-
__sub__(other)[source]
<SpanNotQuery: spanNot(self, other)>
-
__or__(*spans)[source]
<SpanOrQuery: spanOr(spans)>
-
mask(name)[source]
Return lucene FieldMaskingSpanQuery, which allows combining span queries from different fields.
-
near(*spans, **kwargs)[source]
Return lucene SpanNearQuery from SpanQueries.
Parameters: |
- slop – default 0
- inOrder – default True
- collectPayloads – default True
|
-
payload(*values)[source]
Return lucene SpanPayloadCheckQuery from payload values.
SortField
-
class lupyne.engine.queries.SortField(name, type='string', parser=None, reverse=False)[source]
Bases: SortField
Inherited lucene SortField used for caching FieldCache parsers.
Parameters: |
- name – field name
- type – type object or name compatible with SortField constants
- parser – lucene FieldCache.Parser or callable applied to field values
- reverse – reverse flag used with sort
|
-
comparator(reader)[source]
Return indexed values from default FieldCache using the given reader.
Highlighter
-
class lupyne.engine.queries.Highlighter(searcher, query, field, terms=False, fields=False, tag='', formatter=None, encoder=None)[source]
Bases: Highlighter
Inherited lucene Highlighter with stored analysis options.
Parameters: |
- searcher – IndexSearcher used for analysis, scoring, and optionally text retrieval
- query – lucene Query
- field – field name of text
- terms – highlight any matching term in query regardless of position
- fields – highlight matching terms from any field
- tag – optional html tag name
- formatter – optional lucene Formatter
- encoder – optional lucene Encoder
|
-
fragments(doc, count=1)[source]
Return highlighted text fragments.
Parameters: |
- doc – text string or doc id to be highlighted
- count – maximum number of fragments
|
FastVectorHighlighter
-
class lupyne.engine.queries.FastVectorHighlighter(searcher, query, field, terms=False, fields=False, tag='', fragListBuilder=None, fragmentsBuilder=None)[source]
Inherited lucene FastVectorHighlighter with stored query.
Fields must be stored and have term vectors with offsets and positions.
Parameters: |
- searcher – IndexSearcher with stored term vectors
- query – lucene Query
- field – field name of text
- terms – highlight any matching term in query regardless of position
- fields – highlight matching terms from any field
- tag – optional html tag name
- fragListBuilder – optional lucene FragListBuilder
- fragmentsBuilder – optional lucene FragmentsBuilder
|
-
fragments(id, count=1, size=100)[source]
Return highlighted text fragments.
Parameters: |
- id – document id
- count – maximum number of fragments
- size – maximum number of characters in fragment
|
SpellChecker
-
class lupyne.engine.queries.SpellChecker(*args, **kwargs)[source]
Bases: dict
Correct spellings and suggest words for queries.
Supply a vocabulary mapping words to (reverse) sort keys, such as document frequencies.
-
correct(word)[source]
Generate ordered sets of words by increasing edit distance.
-
edits(word, length=0)[source]
Return set of potential words one edit distance away, mapped to valid prefix lengths.
-
suggest(prefix, count=None)[source]
Return ordered suggested words for prefix.
SpellParser
-
class lupyne.engine.queries.SpellParser[source]
Inherited lucene QueryParser which corrects spelling.
Assign a searcher attribute or override correct() implementation.
-
searcher
IndexSearcher
-
correct(term)[source]
Return term with text replaced as necessary.
-
rewrite(query)[source]
Return term or phrase query with corrected terms substituted.
spatial
Geospatial fields.
Latitude/longitude coordinates are encoded into the quadkeys of MS Virtual Earth,
which are also compatible with Google Maps and OSGEO Tile Map Service.
See http://www.maptiler.org/google-maps-coordinates-tile-bounds-projection/.
The quadkeys are then indexed using a prefix tree, creating a cartesian tier of tiles.
Tiler
-
class lupyne.engine.spatial.Tiler(tileSize=256)[source]
Utilities for transforming lat/lngs, projected coordinates, and tile coordinates.
-
coords(tile)[source]
Return TMS coordinates of tile.
-
decode(tile)[source]
Return lat/lng bounding box (bottom, left, top, right) of tile.
-
encode(lat, lng, precision)[source]
Return tile from latitude, longitude and precision level.
-
project(lat, lon)
Converts given lat/lon in WGS84 Datum to XY in Spherical Mercator EPSG:900913
-
radiate(lat, lng, distance, precision, limit=inf)[source]
Generate tile keys within distance of given point, adjusting precision to limit the number considered.
-
walk(left, bottom, right, top, precision)[source]
Generate tile keys which span bounding box of meters.
PointField
-
class lupyne.engine.spatial.PointField(name, precision=30, **kwargs)[source]
Bases: lupyne.engine.documents.NumericField, lupyne.engine.spatial.Tiler
Geospatial points, which create a tiered index of tiles.
Points must still be stored if exact distances are required upon retrieval.
Parameters: | precision – zoom level, i.e., length of encoded value |
-
filter(lng, lat, distance, lngfield, latfield, limit=4)[source]
Return lucene LatLongDistanceFilter based on within() query.
Alternative distance comparator which requires the spatial contrib module.
Distances estimates may be more accurate, and performance may vary.
-
items(*points)[source]
Generate tiles from points (lng, lat).
-
near(lng, lat, precision=None)[source]
Return prefix query for point at given precision.
-
prefix(tile)[source]
Return range query which is equivalent to the prefix of the tile.
-
within(lng, lat, distance, limit=4)[source]
Return range queries for any tiles which could be within distance of given point.
Parameters: |
- lng,lat – point
- distance – search radius in meters
- limit – maximum number of tiles to consider
|
PolygonField
-
class lupyne.engine.spatial.PolygonField(name, precision=30, **kwargs)[source]
Bases: lupyne.engine.spatial.PointField
PointField which implicitly supports polygons (technically linear rings of points).
Differs from points in that all necessary tiles are included to match the points’ boundary.
As with PointField, the tiered tiles are a search optimization, not a distance calculator.
-
items(*polygons)[source]
Generate all covered tiles from polygons.
DistanceFilter
-
class lupyne.engine.spatial.DistanceFilter(filter, lng, lat, distance, lngfield, latfield)[source]
Inherited lucene LatLongDistanceFilter which supports the comparator interface.
-
__getitem__(id)[source]
-
sorter(reverse=False)[source]
Return lucene SortField based on the filter’s cached distances.