Mini Tutorial: Searching by SKU with Solr (BONUS: using Stop words)

Magento Enterprise edition comes with option to use Apache Solr as the main search engine. If you have had an opportunity to use this you will know that it far exceeds the abilities of the built in MYSQL search. Setting up Apache SOLR to work with Magento is surprisingly rather simple as Magento provides the configuration files required, however customizing how the two interact beyond the default configuration requires a much deeper understand of how Solr actually works. That being said I recently came about a shortcoming with the default Magento/Solr configuration that had me doing just that.

The shortcoming I’m referring to is searching by SKU, and depending on your product configuration it might go unnoticed. The issue arises when trying to search for a partial match on a SKU that contains a string of numbers. Basically with the default configuration, provided the SKU field will not return results for a partial search based on numbers, it treats it as an all or nothing when comparing numbers for this field. Searching for a product with a SKU of “product123456” by the exact SKU will return exactly one match, where searching by just “product” will return all products with “product” in the SKU, and yet searching for “product12345” will return no results (unless another product has that exact SKU) not even the original result which is very counter-intuitive, and not what a user would expect.

The solution I have found to work around this is to alter the Solr configuration to add new field called “sku_partial” which is a clone of the SKU field but uses a new field type with analyzer’s that will allow for partial results when searching. This can be accomplished by editing two files (solrconfig.xml and schema.xml both located in “solr/conf/” in your Solr install directory) with just 4 changes.

In schema.xml add the following changes:

1) Towards the bottom of the document inside the “schema” node add:

<copyField source="sku" dest="sku_partial" />

Note: this sets the source of our new “sku_partial field”

2) Inside the “fields” node add:

<field name="sku_partial" type="sku_partial" indexed="true" stored="true"/>

Note: the declaration of our new “sku_partial” field, that is of our new type also named “sku_partial” for consistency.

3) Inside the “types” node add:

<fieldType name="sku_partial" class="solr.TextField">
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="front" />
          <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="back" />
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.TrimFilterFactory" />
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          <filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
      </analyzer>
      <analyzer type="query">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.TrimFilterFactory" />
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
          <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
      </analyzer>
    </fieldType>

Note: this is the new field type which uses n-gram analyzer’s to tokenize the SKU into smaller “tokens” in order to have partial matches, so 123456 will match on 123,1234,12345 etc

4) Next in solrconf.xml locate the “requestHandler” node(s) that are used for your stores locale (search for “magento_en” for English, and “magento_fr” for French) and change the line(s) following lines from:

<str name="qf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0 </str>
<str name="pf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0 </str>

To:

<str name="qf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0 sku_partial^1.0</str>
<str name="pf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0 sku_partial^1.0</str>

Note: here were adding telling Solr/Magento that it can search our new field we added

That’s it, restart Solr to load the new configuration, re-index the catalog search index to populate the new field and your ready for better search results.

BONUS: Using Stop words to improve search relevancy :

To help improve your search relevancy use what Solr refers to as “Stop words”, which is basically words that should be ignored and not matched on during a search. The configuration files that come with Magento include a stopwords.txt file with a list of the basics (“to”, “the”, “and”, “at” etc). However while working on this project I discovered that this file isn’t actually being used, and instead uses the locale specific files. So to ensure you’re getting accurate and relevant search results be sure to add your desired “Stop words” to the “stopwords_YOURLOCALE.txt” file located in your “solr/conf” directory.