cell - Retrieving extracted text with Apache Solr -

- March 15, 2014

i'm new apache solr, , want use indexing pdf files. managed , running far , can search added pdf files.

however, need able retrieve searched text results.

i found xml snippet in default solrconfig.xml concerning that:

<requesthandler name="/update/extract" class="org.apache.solr.handler.extraction.extractingrequesthandler" startup="lazy"> <lst name="defaults">   <!-- main content goes "text"... if need return        extracted text or highlighting, use stored field. -->   <str name="fmap.content">text</str>   <str name="lowernames">true</str>   <str name="uprefix">ignored_</str>    <!-- capture link hrefs ignore div attributes -->   <str name="captureattr">true</str>   <str name="fmap.a">links</str>   <str name="fmap.div">ignored_</str> </lst>

from here (http://www.lucidimagination.com/community/hear-from-the-experts/articles/content-extraction-tika), think have add new field schema.xml (e.g. "content") has stored="true" , indexed="true". however, i'm not sure how accomplish exactly?

any appreciated, thx

add schema.xml looking this:

<?xml version="1.0" encoding="utf-8" ?>  <schema name="whatever" version="1.2">     <types>         <fieldtype name="string" class="solr.strfield" sortmissinglast="true" omitnorms="true"/>         <fieldtype name="int" class="solr.trieintfield" precisionstep="0" omitnorms="true" positionincrementgap="0"/>         <fieldtype name="float" class="solr.triefloatfield" precisionstep="0" omitnorms="true" positionincrementgap="0"/>         <fieldtype name="long" class="solr.trielongfield" precisionstep="0" omitnorms="true" positionincrementgap="0"/>         <fieldtype name="double" class="solr.triedoublefield" precisionstep="0" omitnorms="true" positionincrementgap="0"/>         <fieldtype name="date" class="solr.triedatefield" omitnorms="true" precisionstep="0" positionincrementgap="0"/>         <fieldtype name="text" class="solr.textfield" positionincrementgap="100">             <analyzer type="index">                 <charfilter class="solr.htmlstripcharfilterfactory"/>                 <charfilter class="solr.mappingcharfilterfactory" mapping="../../mapping-isolatin1accent.txt"/>                 <tokenizer class="solr.standardtokenizerfactory"/>                 <filter class="solr.standardfilterfactory"/>                 <filter class="solr.lowercasefilterfactory"/>             </analyzer>             <analyzer type="query">                 <charfilter class="solr.htmlstripcharfilterfactory"/>                 <charfilter class="solr.mappingcharfilterfactory" mapping="../../mapping-isolatin1accent.txt"/>                 <tokenizer class="solr.standardtokenizerfactory"/>                 <filter class="solr.standardfilterfactory"/>                 <filter class="solr.lowercasefilterfactory"/>             </analyzer>         </fieldtype>     </types>     <fields>         <field name="internal_id" type="string" indexed="true" stored="true"/>         <field name="cat" type="int" indexed="true" stored="true"/>         <field name="desc" type="text" indexed="true" stored="true"/>     </fields>     <uniquekey>internal_id</uniquekey>     <defaultsearchfield>desc</defaultsearchfield>     <solrqueryparser defaultoperator="or"/>     <similarity class="org.apache.lucene.search.defaultsimilarity"/> </schema>

if "field" "stored", show in results, default.

Search This Blog

Aleternatvie

cell - Retrieving extracted text with Apache Solr -

Comments

Post a Comment

Popular posts from this blog

java - netbeans "Please wait - classpath scanning in progress..." -

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

openxml - Programmatically format a date in an excel sheet using Office Open Xml SDK -