Package org.apache.lucene.document
Document for indexing and
searching.
The document package provides the user level logical representation of content to be indexed
and searched. The package also provides utilities for working with Documents and IndexableFields.
Document and IndexableField
A Document is a collection of IndexableFields. A IndexableField is a
logical representation of a user's content that needs to be indexed or stored. IndexableFields have a number of properties that tell Lucene how to
treat the content (like indexed, tokenized, stored, etc.) See the Field implementation of IndexableField for specifics on these properties.
Note: it is common to refer to Documents having Fields, even though technically they have IndexableFields.
Working with Documents
First and foremost, a Document is something created by the
user application. It is your job to create Documents based on the content of the files you are
working with in your application (Word, txt, PDF, Excel or any other format.) How this is done is
completely up to you. That being said, there are many tools available in other projects that can
make the process of taking a file and converting it into a Lucene Document.
How to index ...
Strings
TextField allows indexing tokens from a String so that one
can perform full-text search on it. The way that the input is tokenized depends on the Analyzer that is configured on the IndexWriterConfig. TextField can also be optionally stored.
KeywordField indexes whole values as a single term so that
one can perform exact search on it. It also records doc values to enable sorting or faceting on
this field. Finally, it also supports optionally storing the value.
If faceting or sorting are not required, StringField is a
variant of KeywordField that does not index doc values.
Numbers
If a numeric field represents an identifier rather than a quantity and is more commonly
searched on single values than on ranges of values, it is generally recommended to index its
string representation via KeywordField (or StringField if doc values are not necessary).
LongField, IntField,
DoubleField and FloatField
index values in a points index for efficient range queries, and also create doc-values for these
fields for efficient sorting and faceting.
If the field is aimed at being used to tune the score, FeatureField helps internally store numeric data as term frequencies
in a way that makes it efficient to influence scoring at search time.
Other types of structured data
It is recommended to index dates as a LongField that stores
the number of milliseconds since Epoch.
IP fields can be indexed via InetAddressPoint in addition
to a SortedDocValuesField (if the field is single-valued) or
SortedSetDocValuesField that stores the result of InetAddressPoint.encode(java.net.InetAddress).
Dense numeric vectors
Dense numeric vectors can be indexed with KnnFloatVectorField if its dimensions are floating-point numbers or
KnnByteVectorField if its dimensions are bytes. This allows
searching for nearest neighbors at search time.
Sparse numeric vectors
To perform nearest-neighbor search on sparse vectors rather than dense vectors, each dimension
of the sparse vector should be indexed as a FeatureField.
Queries can then be constructed as a BooleanQuery with linear queries as
BooleanClause.Occur.SHOULD clauses.
-
ClassDescriptionField that stores a per-document
BytesRefvalue.An indexed binary field for fast range filters.A binary representation of a range that wraps a BinaryDocValues fieldProvides support for converting dates to strings and vice-versa.Specifies the time granularity.Documents are the unit of indexing and search.AStoredFieldVisitorthat creates aDocumentfrom stored fields.Syntactic sugar for encoding doubles as NumericDocValues viaDouble.doubleToRawLongBits(double).Field that stores a per-documentdoublevalue for scoring, sorting or value retrieval and index the field for fast range filters.An indexeddoublefield for fast range filters.An indexed Double Range field.DocValues field for DoubleRange.Fieldthat can be used to store static scoring factors into documents.Expert: directly create a field for a document.Specifies whether and how a field should be stored.Describes the properties of a field.Syntactic sugar for encoding floats as NumericDocValues viaFloat.floatToRawIntBits(float).Field that stores a per-documentfloatvalue for scoring, sorting or value retrieval and index the field for fast range filters.An indexedfloatfield for fast range filters.An indexed Float Range field.DocValues field for FloatRange.An indexed 128-bitInetAddressfield.An indexed InetAddress Range FieldField that stores a per-documentintvalue for scoring, sorting or value retrieval and index the field for fast range filters.An indexedintfield for fast range filters.An indexed Integer Range field.DocValues field for IntRange.Describes how anIndexableFieldshould be inverted for indexing terms and postings.Field that indexes a per-document String orBytesRefinto an inverted index for fast filtering, stores values in a columnar fashion usingDocValuesType.SORTED_SETdoc values for sorting and faceting, and optionally stores values as stored fields for top-hits retrieval.A field that contains a single byte numeric vector (or none) for each document.A field that contains a single floating-point numeric vector (or none) for each document.An per-document location field.An indexed location field.An geo shape utility class for indexing and searching gis geometries whose vertices are latitude, longitude values (in decimal degrees).A concrete implementation ofShapeDocValuesfor storing binary doc value representation ofLatLonShapegeometries in aLatLonShapeDocValuesFieldConcrete implementation of aShapeDocValuesFieldfor geographic geometries.Field that stores a per-documentlongvalue for scoring, sorting or value retrieval and index the field for fast range filters.An indexedlongfield for fast range filters.An indexed Long Range field.DocValues field for LongRange.Field that stores a per-documentlongvalue for scoring, sorting or value retrieval.Query class for searchingRangeFieldtypes by a definedPointValues.Relation.Used byRangeFieldQueryto check how each internal or leaf node relates to the query.A doc values field forLatLonShapeandXYShapethat usesShapeDocValuesas the underlying binary doc value format.A base shape utility class used for both LatLon (spherical) and XY (cartesian) shape fields.Represents a encoded triangle usingShapeField.decodeTriangle(byte[], DecodedTriangle).type of triangleQuery Relation Types *polygons are decomposed into tessellated triangles usingTessellatorthese triangles are encoded and inserted as separate indexed POINT fieldsField that stores a per-documentBytesRefvalue, indexed for sorting.Field that stores a per-documentlongvalues for scoring, sorting or value retrieval.Field that stores a set of per-documentBytesRefvalues, indexed for faceting,grouping,joining.A field whose value is stored so thatIndexSearcher.storedFields()andIndexReader.storedFields()will return the field and its value.Abstraction around a stored value.Type of aStoredValue.A field that is indexed but not tokenized: the entire String value is indexed as a single token.A field that is indexed and tokenized, without term vectors.An per-document location field.XYGeometry query forXYDocValuesField.An indexed XY position field.A cartesian shape utility class for indexing and searching geometries whose vertices are unitless x, y values.A concrete implementation ofShapeDocValuesfor storing binary doc value representation ofXYShapegeometries in aXYShapeDocValuesFieldConcrete implementation of aShapeDocValuesFieldfor cartesian geometries.