Package org.apache.lucene.classification
Class SimpleNaiveBayesClassifier
java.lang.Object
org.apache.lucene.classification.SimpleNaiveBayesClassifier
- All Implemented Interfaces:
Classifier<BytesRef>
- Direct Known Subclasses:
CachingNaiveBayesClassifier,SimpleNaiveBayesDocumentClassifier
A simplistic Lucene based NaiveBayes classifier, see
http://en.wikipedia.org/wiki/Naive_Bayes_classifier- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final AnalyzerAnalyzerto be used for tokenizing unseen input textprotected final Stringname of the field to be used as a class / category outputprotected final IndexReaderIndexReaderused to access theClassifier's indexprotected final IndexSearcherIndexSearcherto run searches on the index for retrieving frequenciesprotected final QueryQueryused to eventually filter the document set to be used to classifyprotected final String[]names of the fields to be used as input text -
Constructor Summary
ConstructorsConstructorDescriptionSimpleNaiveBayesClassifier(IndexReader indexReader, Analyzer analyzer, Query query, String classFieldName, String... textFieldNames) Creates a new NaiveBayes classifier. -
Method Summary
Modifier and TypeMethodDescriptionassignClass(String inputDocument) Assign a class (with score) to the given text Stringprotected List<ClassificationResult<BytesRef>> assignClassNormalizedList(String inputDocument) Calculate probabilities for all classes for a given input textprotected intcount the number of documents in the index having at least a value for the 'class' fieldgetClasses(String text) Get all the classes (sorted by score, descending) assigned to the given text String.getClasses(String text, int max) Get the firstmaxclasses (sorted by score, descending) assigned to the given text String.protected ArrayList<ClassificationResult<BytesRef>> normClassificationResults(List<ClassificationResult<BytesRef>> assignedClasses) Normalize the classification results based on the max score availableprotected String[]tokenize aStringon this classifier's text fields and analyzer
-
Field Details
-
indexReader
IndexReaderused to access theClassifier's index -
textFieldNames
names of the fields to be used as input text -
classFieldName
name of the field to be used as a class / category output -
analyzer
Analyzerto be used for tokenizing unseen input text -
indexSearcher
IndexSearcherto run searches on the index for retrieving frequencies -
query
Queryused to eventually filter the document set to be used to classify
-
-
Constructor Details
-
SimpleNaiveBayesClassifier
public SimpleNaiveBayesClassifier(IndexReader indexReader, Analyzer analyzer, Query query, String classFieldName, String... textFieldNames) Creates a new NaiveBayes classifier.- Parameters:
indexReader- the reader on the index to be used for classificationanalyzer- anAnalyzerused to analyze unseen textquery- aQueryto eventually filter the docs used for training the classifier, ornullif all the indexed docs should be usedclassFieldName- the name of the field used as the output for the classifier NOTE: must not be havely analyzed as the returned class will be a token indexed for this fieldtextFieldNames- the name of the fields used as the inputs for the classifier, NO boosting supported per field
-
-
Method Details
-
assignClass
Description copied from interface:ClassifierAssign a class (with score) to the given text String- Specified by:
assignClassin interfaceClassifier<BytesRef>- Parameters:
inputDocument- a String containing text to be classified- Returns:
- a
ClassificationResultholding assigned class of typeTand score - Throws:
IOException- If there is a low-level I/O error.
-
getClasses
Description copied from interface:ClassifierGet all the classes (sorted by score, descending) assigned to the given text String.- Specified by:
getClassesin interfaceClassifier<BytesRef>- Parameters:
text- a String containing text to be classified- Returns:
- the whole list of
ClassificationResult, the classes and scores. Returnsnullif the classifier can't make lists. - Throws:
IOException- If there is a low-level I/O error.
-
getClasses
Description copied from interface:ClassifierGet the firstmaxclasses (sorted by score, descending) assigned to the given text String.- Specified by:
getClassesin interfaceClassifier<BytesRef>- Parameters:
text- a String containing text to be classifiedmax- the number of return list elements- Returns:
- the whole list of
ClassificationResult, the classes and scores. Cut for "max" number of elements. Returnsnullif the classifier can't make lists. - Throws:
IOException- If there is a low-level I/O error.
-
assignClassNormalizedList
protected List<ClassificationResult<BytesRef>> assignClassNormalizedList(String inputDocument) throws IOException Calculate probabilities for all classes for a given input text- Parameters:
inputDocument- the input text as aString- Returns:
- a
ListofClassificationResult, one for each existing class - Throws:
IOException- if assigning probabilities fails
-
countDocsWithClass
count the number of documents in the index having at least a value for the 'class' field- Returns:
- the no. of documents having a value for the 'class' field
- Throws:
IOException- if accessing to term vectors or search fails
-
tokenize
tokenize aStringon this classifier's text fields and analyzer- Parameters:
text- theStringrepresenting an input text (to be classified)- Returns:
- a
Stringarray of the resulting tokens - Throws:
IOException- if tokenization fails
-
normClassificationResults
protected ArrayList<ClassificationResult<BytesRef>> normClassificationResults(List<ClassificationResult<BytesRef>> assignedClasses) Normalize the classification results based on the max score available- Parameters:
assignedClasses- the list of assigned classes- Returns:
- the normalized results
-