Help:Search: Difference between revisions

Find traditional instrumental music
(Replaced content with "Under construction")
Tag: Replaced
No edit summary
Line 1: Line 1:
Under construction
<!--T:413-->
The quickest way to find information in the Traditional Tune Archive is to look it up directly. On every page there is a '''search''' box.
 
<!--T:1-->
'''CirrusSearch''' is a Traditional Tune Archive extension that uses Elasticsearch to provide enhanced search features over the default TTA search. The Wikimedia Foundation uses CirrusSearch for all Wikimedia projects.
<!--T:66-->
This page describes the features of CirrusSearch.
<!--T:3-->
<!--T:174-->
== How it works ==
<!--T:414-->
<!--T:415-->
Enter key words and phrases and press ''Enter'' or ''Return'' on your keyboard. Or click the magnifying glass icon, Search, or Go button.
<!--T:416-->
If a Tune has the same title as what you entered you will be directed to that page.
<!--T:417-->
Otherwise, it searches all pages on the wiki, and presents a list of articles that matched your search terms, or a message informing you that no page has all the key words and phrases.
 
<!--T:418-->
If you click the search button without filling in anything, you will be taken to "Special:Search" which gives you extra searching options (also available from any search results list)
 
<!--T:419-->
You may find it useful to restrict a search to pages within a particular namespace e.g., only search within the Book pages.
<!--T:420-->
Check the namespaces you require for this search.
 
<!--T:421-->
By default only the namespaces specified in your preferences will be searched.
<!--T:422-->
Logged-in users can change their preferences to specify the namespaces they want to search by default.
<!--T:423-->
This can be done by selecting and deselecting boxes in the ”search” section of user preferences.
 
<!--T:466-->
Adding additional default namespaces has a side effect of changing the autocomplete algorithm from the default to a somewhat stricter algorithm.
 
== What's improved? ==
 
<!--T:5-->
CirrusSearch features three main improvements over the default MediaWiki search, namely:
 
<!--T:6-->
* Better support for searching in different languages.
<!--T:7-->
* Faster updates to the search index, meaning changes to articles are reflected in search results much faster.
<!--T:8-->
* Expanding templates, meaning that all content from a template is now reflected in search results.
 
== How frequently is the search index updated? == <!--T:9-->
 
<!--T:10-->
Updates to the search index are done in near real time.
<!--T:67-->
Changes to pages should appear immediately in the search results.
<!--T:68-->
Changes to templates should take effect in articles that include the template in a few minutes.
<!--T:69-->
The templates changes use the job queue, so performance may vary.
<!--T:70-->
A null edit to the article will force the change through, but that shouldn't be required if everything is going well.
 
== Search suggestions == <!--T:11-->
 
<!--T:12-->
The search suggestions you get when you type into the search box that drops down candidate pages is sorted by a rough measure of article quality.
<!--T:371-->
This takes into account the number of incoming wikilinks, the size of the page, the number of external links, the number of headings, and the number of redirects.
<!--T:71-->
Search suggestions can be skipped and queries will go directly to the search results page. Add a tilde <code>~</code> before the query. Example "~Jackie Layton". The search suggestions will still appear, but hitting the Enter key at any time will take you to the search results page.
 
<!--T:13-->
ASCII/accents/diacritics folding is turned on for English text, but there are some formatting problems with the result.
 
== Full text search == <!--T:14-->
 
<!--T:15-->
A "full text search" is an "indexed search". All pages are stored in the wiki database, and all the words in the non-redirect pages are stored in the search database, which is an index to practically the full text of the wiki. Each visible word is indexed to the list of pages where it is found, so a search for a word is as fast as looking up a single-record.
<!--T:427-->
Note that the tagline is not part of the actual content. To see the searchable content for a page append ?action=cirrusdump to the URL.
<!--T:428-->
Furthermore, for any changes in wording, the search index is updated within seconds.
<!--T:175-->
There are many indexes of the "full text" of the wiki to facilitate the many types of searches needed. The full wikitext is indexed many times into many special-purpose indexes, each parsing the wikitext in whatever way optimizes their use. Example indexes include:
 
<!--T:176-->
* "auxiliary" text, includes hatnotes, captions, ToC, and any wikitext classed by an HTML attribute.
* "Lead-in" text is the wikitext between the top of the page and the first heading.
* The "category" text indexes the listings at the bottom.
* Templates are indexed. If the transcluded words of a template change, then all the pages that transclude it are updated. (This can take a long time depending on a job queue.) If the subtemplates used by a template change, the index is updated.
* Document contents that are stored in the File/Media namespace are now indexed. Thousands of formats are recognized.
 
<!--T:78-->
There is support for dozens of languages, but all languages are wanted.
<!--T:81-->
There is a list of currently supported languages at [http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html elasticsearch.org]; see their [http://www.elasticsearch.org/contributing-to-elasticsearch/ documentation on contributing] to submit requests or patches.
 
<!--T:177-->
CirrusSearch will optimize your query, and run it. The resulting titles are weighted by relevance, and heavily post-processed, 20 at a time, for the search results page. For example snippets are garnered from the article, and search terms are highlighted in bold text.
 
<!--T:178-->
Search results will often be accompanied by various preliminary reports. These include ''Did you mean'' (spelling correction), and, when no results would otherwise be found it will say ''Showing results for'' (query correction) and ''search instead for'' (your query).
 
<!--T:179-->
Search features also include:
 
<!--T:180-->
* sorting navigation suggestions by the number of incoming links.
* Starting with the tilde character ~ to disable navigation and suggestions in such a way that also preserves page ranking.
* Smart-matching characters by normalizing (or "folding") non-keyboard characters into keyboard characters.
<!--T:181-->
* Words and phrases that match are highlighted in bold on the search results page. The highlighter is a cosmetic analyzer, while the search-indexing analyzer actually finds the page, and these may not be 100% in sync, especially for regex.  The highlighter can match more or less accurately than the indexer.
 
=== Words, phrases, and modifiers === <!--T:182-->
 
<!--T:183-->
The basic search term is a word or a "phrase in quotes". Search recognizes a "word" to be:
 
<!--T:184-->
* a string of digits
* a string of letters
* subwords between letters/digit transitions, such as in txt2regex
* subwords inside a compoundName using camelCase
 
<!--T:185-->
A "stop word" is a word that is ignored (because it is common, or for other reasons).<!--T:429--> Stop words are rarely called for in CirrusSearch, except for when they are in certain kinds of phrases, as explained below.
<!--T:186-->
A given search term matches against ''content'' (rendered on the page). To match against wikitext instead, use the insource search parameter (See [[#Insource|section]] below). Each search parameter has its own index, and interpret its given term in its own way.<!--T:430--> CirrusSearch parameters do not use a consistent way to handle these search terms.
 
<!--T:187-->
Spacing between words, phrases, parameters, and input to parameters, can include generous instances of whitespace and ''greyspace characters''. "Greyspace characters" are all the non-alphanumeric characters <code>~!@#$%^&*()_+-={}|[]\:";'<>?,./</code>. A mixed string of ''greyspace characters'' and whitespace characters, is "greyspace", and is treated as one big word boundary. Greyspace is how indexes are made and queries are interpreted.<!--T:431--> [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html The same analyzer] used to index the wikitext is also used to interpret the query.
 
<!--T:188-->
Two exceptions are where 1) an embedded<code>:</code>colon is one word (it being treated as a letter), and 2) an embedded comma <code>,</code> such as in <code>1,2,3</code>,  is treated as a number.
Greyspace characters are otherwise ignored unless, due to query syntax, they can be interpreted as modifier characters.
 
<!--T:189-->
The modifiers are <code>~ * \? - " ! </code>. Depending on their placement in the syntax they can apply to a term, a parameter, or to an entire query. Word and phrase modifiers are the wildcard, proximity, and fuzzy searches. Each parameter can have their own modifiers, but in general:
 
<!--T:190-->
* A fuzzy-word or fuzzy-phrase search can suffix a tilde <code>~</code> character (and a number telling the degree).
* A tilde <code>~</code> character prefixed to the first term of a query guarantees search results instead of any possible navigation.
* A wildcard character inside a word can be a (escaped) question mark <code>\?</code> for one character or an asterisk <code>*</code> character for more.
* Truth-logic can interpret <code>AND</code> and <code>OR</code>, but parameters cannot.
* Truth-logic understands <code>-</code> or <code>!</code> prefixed to a term to invert the usual meaning of the term from "match" to "exclude".
Words that begin with -  or !, such as -in-law or !Kung can exactly match titles and redirects, but will also match every document that does ''not'' contain the negated word, which is usually almost all documents. To search for such terms other than as exact matches for titles or redirects, use the insource search parameter.
* Quotes around words mark an "exact phrase" search. For parameters they are also needed to delimit multi-word input.
* Stemming is automatic but can be turned off using an "exact phrase".
</translate>
 
The two wildcard characters are the star and the (escaped) question mark, and both can come in the middle or end of a word. The escaped question mark <code>\?</code> stands for one character and the star <code>*</code> stands for any number of characters. Because many users, instead of writing a query, will ask a question, any question mark is ignored unless purposefully escaped <code>\?</code> into its wildcard meaning.
 
<!--T:192-->
A phrase search can be initiated by various hints to the search engine. Each method of hinting has a side-effect of how tolerant the matching of the word sequence will be. For ''greyspace'', ''camelCase'', or ''txt2number'' hints:
</translate>
<translate>
<!--T:193-->
* given <code>words-joined_by_greyspace(characters)</code> or <code>wordsJoinedByCamelCaseCharacters</code> it finds <code>words joined by</code> ... <code>characters</code>, in their bare forms or greyspace forms.
* <code>txt2number</code> will match <code>txt 2 number</code>or <code>txt-2.number</code>
* Stop words are enabled for the edge cases (in the periphery) of a grey_space or camelCase phrase. An example using <code>the</code>, <code>of</code>, and <code>a</code> is that <code>the_invisible_hand_of_a</code> matches <code>meetings invisible hand shake</code>.
 
<!--T:194-->
A "search instead" report is triggered when a universally unknown word is ignored in a phrase.
 
<!--T:195-->
Each one of the following types of phrase-matching contains and widens the match-tolerances of the previous one: 
 
<!--T:196-->
* An "exact phrase" "in quotes" will tolerate (match with) greyspace. Given <code>"exact_phrase"</code> or <code>"exact phrase"</code> it matches <code>"exact]phrase"</code>.
* A greyspace_phrase initiates stemming and ''stop word'' checks.
* Given <code>CamelCase</code> it will ''additionally'' match <code>camelcase</code>, in all lowercase, because CirrusSearch is not case sensitive in matching.
 
<!--T:424-->
Page ranking saves you from typing quotes for a two-word search. With no quotes a word-pair index is used in page-ranking, plus it finds the two words anywhere on the page. 
 
<!--T:198-->
Some parameters interpret greyspace phrases, but other parameters, like <code>insource</code> only interpret the usual "phrase in quotes".
 
In search terminology, support for "stemming" means that a search for "swim" will also include "swimming" and "swimmed", but not "swam".

Revision as of 16:32, 12 February 2020

The quickest way to find information in the Traditional Tune Archive is to look it up directly. On every page there is a search box.

CirrusSearch is a Traditional Tune Archive extension that uses Elasticsearch to provide enhanced search features over the default TTA search. The Wikimedia Foundation uses CirrusSearch for all Wikimedia projects. This page describes the features of CirrusSearch.

How it works

Enter key words and phrases and press Enter or Return on your keyboard. Or click the magnifying glass icon, Search, or Go button. If a Tune has the same title as what you entered you will be directed to that page. Otherwise, it searches all pages on the wiki, and presents a list of articles that matched your search terms, or a message informing you that no page has all the key words and phrases.

If you click the search button without filling in anything, you will be taken to "Special:Search" which gives you extra searching options (also available from any search results list)

You may find it useful to restrict a search to pages within a particular namespace e.g., only search within the Book pages. Check the namespaces you require for this search.

By default only the namespaces specified in your preferences will be searched. Logged-in users can change their preferences to specify the namespaces they want to search by default. This can be done by selecting and deselecting boxes in the ”search” section of user preferences.

Adding additional default namespaces has a side effect of changing the autocomplete algorithm from the default to a somewhat stricter algorithm.

What's improved?

CirrusSearch features three main improvements over the default MediaWiki search, namely:

  • Better support for searching in different languages.
  • Faster updates to the search index, meaning changes to articles are reflected in search results much faster.
  • Expanding templates, meaning that all content from a template is now reflected in search results.

How frequently is the search index updated?

Updates to the search index are done in near real time. Changes to pages should appear immediately in the search results. Changes to templates should take effect in articles that include the template in a few minutes. The templates changes use the job queue, so performance may vary. A null edit to the article will force the change through, but that shouldn't be required if everything is going well.

Search suggestions

The search suggestions you get when you type into the search box that drops down candidate pages is sorted by a rough measure of article quality. This takes into account the number of incoming wikilinks, the size of the page, the number of external links, the number of headings, and the number of redirects. Search suggestions can be skipped and queries will go directly to the search results page. Add a tilde ~ before the query. Example "~Jackie Layton". The search suggestions will still appear, but hitting the Enter key at any time will take you to the search results page.

ASCII/accents/diacritics folding is turned on for English text, but there are some formatting problems with the result.

Full text search

A "full text search" is an "indexed search". All pages are stored in the wiki database, and all the words in the non-redirect pages are stored in the search database, which is an index to practically the full text of the wiki. Each visible word is indexed to the list of pages where it is found, so a search for a word is as fast as looking up a single-record. Note that the tagline is not part of the actual content. To see the searchable content for a page append ?action=cirrusdump to the URL. Furthermore, for any changes in wording, the search index is updated within seconds. There are many indexes of the "full text" of the wiki to facilitate the many types of searches needed. The full wikitext is indexed many times into many special-purpose indexes, each parsing the wikitext in whatever way optimizes their use. Example indexes include:

  • "auxiliary" text, includes hatnotes, captions, ToC, and any wikitext classed by an HTML attribute.
  • "Lead-in" text is the wikitext between the top of the page and the first heading.
  • The "category" text indexes the listings at the bottom.
  • Templates are indexed. If the transcluded words of a template change, then all the pages that transclude it are updated. (This can take a long time depending on a job queue.) If the subtemplates used by a template change, the index is updated.
  • Document contents that are stored in the File/Media namespace are now indexed. Thousands of formats are recognized.

There is support for dozens of languages, but all languages are wanted. There is a list of currently supported languages at elasticsearch.org; see their documentation on contributing to submit requests or patches.

CirrusSearch will optimize your query, and run it. The resulting titles are weighted by relevance, and heavily post-processed, 20 at a time, for the search results page. For example snippets are garnered from the article, and search terms are highlighted in bold text.

Search results will often be accompanied by various preliminary reports. These include Did you mean (spelling correction), and, when no results would otherwise be found it will say Showing results for (query correction) and search instead for (your query).

Search features also include:

  • sorting navigation suggestions by the number of incoming links.
  • Starting with the tilde character ~ to disable navigation and suggestions in such a way that also preserves page ranking.
  • Smart-matching characters by normalizing (or "folding") non-keyboard characters into keyboard characters.
  • Words and phrases that match are highlighted in bold on the search results page. The highlighter is a cosmetic analyzer, while the search-indexing analyzer actually finds the page, and these may not be 100% in sync, especially for regex. The highlighter can match more or less accurately than the indexer.

Words, phrases, and modifiers

The basic search term is a word or a "phrase in quotes". Search recognizes a "word" to be:

  • a string of digits
  • a string of letters
  • subwords between letters/digit transitions, such as in txt2regex
  • subwords inside a compoundName using camelCase

A "stop word" is a word that is ignored (because it is common, or for other reasons). Stop words are rarely called for in CirrusSearch, except for when they are in certain kinds of phrases, as explained below. A given search term matches against content (rendered on the page). To match against wikitext instead, use the insource search parameter (See section below). Each search parameter has its own index, and interpret its given term in its own way. CirrusSearch parameters do not use a consistent way to handle these search terms.

Spacing between words, phrases, parameters, and input to parameters, can include generous instances of whitespace and greyspace characters. "Greyspace characters" are all the non-alphanumeric characters ~!@#$%^&*()_+-={}|[]\:";'<>?,./. A mixed string of greyspace characters and whitespace characters, is "greyspace", and is treated as one big word boundary. Greyspace is how indexes are made and queries are interpreted. The same analyzer used to index the wikitext is also used to interpret the query.

Two exceptions are where 1) an embedded:colon is one word (it being treated as a letter), and 2) an embedded comma , such as in 1,2,3, is treated as a number. Greyspace characters are otherwise ignored unless, due to query syntax, they can be interpreted as modifier characters.

The modifiers are ~ * \? - " ! . Depending on their placement in the syntax they can apply to a term, a parameter, or to an entire query. Word and phrase modifiers are the wildcard, proximity, and fuzzy searches. Each parameter can have their own modifiers, but in general:

  • A fuzzy-word or fuzzy-phrase search can suffix a tilde ~ character (and a number telling the degree).
  • A tilde ~ character prefixed to the first term of a query guarantees search results instead of any possible navigation.
  • A wildcard character inside a word can be a (escaped) question mark \? for one character or an asterisk * character for more.
  • Truth-logic can interpret AND and OR, but parameters cannot.
  • Truth-logic understands - or ! prefixed to a term to invert the usual meaning of the term from "match" to "exclude".

Words that begin with - or !, such as -in-law or !Kung can exactly match titles and redirects, but will also match every document that does not contain the negated word, which is usually almost all documents. To search for such terms other than as exact matches for titles or redirects, use the insource search parameter.

  • Quotes around words mark an "exact phrase" search. For parameters they are also needed to delimit multi-word input.
  • Stemming is automatic but can be turned off using an "exact phrase".

</translate>

The two wildcard characters are the star and the (escaped) question mark, and both can come in the middle or end of a word. The escaped question mark \? stands for one character and the star * stands for any number of characters. Because many users, instead of writing a query, will ask a question, any question mark is ignored unless purposefully escaped \? into its wildcard meaning.

A phrase search can be initiated by various hints to the search engine. Each method of hinting has a side-effect of how tolerant the matching of the word sequence will be. For greyspace, camelCase, or txt2number hints: </translate> <translate>

  • given words-joined_by_greyspace(characters) or wordsJoinedByCamelCaseCharacters it finds words joined by ... characters, in their bare forms or greyspace forms.
  • txt2number will match txt 2 numberor txt-2.number
  • Stop words are enabled for the edge cases (in the periphery) of a grey_space or camelCase phrase. An example using the, of, and a is that the_invisible_hand_of_a matches meetings invisible hand shake.

A "search instead" report is triggered when a universally unknown word is ignored in a phrase.

Each one of the following types of phrase-matching contains and widens the match-tolerances of the previous one:

  • An "exact phrase" "in quotes" will tolerate (match with) greyspace. Given "exact_phrase" or "exact phrase" it matches "exact]phrase".
  • A greyspace_phrase initiates stemming and stop word checks.
  • Given CamelCase it will additionally match camelcase, in all lowercase, because CirrusSearch is not case sensitive in matching.

Page ranking saves you from typing quotes for a two-word search. With no quotes a word-pair index is used in page-ranking, plus it finds the two words anywhere on the page.

Some parameters interpret greyspace phrases, but other parameters, like insource only interpret the usual "phrase in quotes".

In search terminology, support for "stemming" means that a search for "swim" will also include "swimming" and "swimmed", but not "swam".