Extension talk:CirrusSearch/2018
This page used the Structured Discussions extension to give structured discussions. It has since been converted to wikitext, so the content and history here are only an approximation of what was actually displayed at the time these comments were made. |
Discussion related to the CirrusSearch MediaWiki extension.
See also the open tasks for CirrusSearch on phabricator.
Problem running updateSuggesterIndex.php
Using REL1_29
I already run the forceSearchIndex.php script.
Look at the indices:
> curl -s localhost:9200/_cat/indices green open wikidexwiki_content_first EgLUpqUuS66x1bUvIVtoKw 4 0 14106 2836 739.2mb 739.2mb green open wikidexwiki_general_first L4Y2SfehQdCg2oaZiT2Ing 4 0 262131 62386 1.9gb 1.9gb green open mw_cirrus_metastore_first 2cf0X6ZpQi6yr6ZE0-6jSA 1 0 3 2 8.5kb 8.5kb
Running updateSuggesterIndex.php fails:
> php extensions/CirrusSearch/maintenance/updateSuggesterIndex.php Scanning available plugins... analysis-icu Picking analyzer...spanish Fetching Elasticsearch version...5.6.5...ok Inferring index identifier...wikidexwiki_titlesuggest_first Setting index identifier...wikidexwiki_titlesuggest_1515680487 2018-01-11 14:21:27 Waiting for the index to go green... Green! 2018-01-11 14:21:27 Setting max_docs to 14106 2018-01-11 14:21:27 Indexing 14106 documents from content with batchId: 1515680487 and scoring method: quality 10% done... 14% done... 24% done... 28% done... 38% done... 42% done... 46% done... 56% done... 60% done... 70% done... 74% done... 88% done... 92% done... 100% done... 2018-01-11 14:21:36 Indexing from content index done. 2018-01-11 14:21:36 Indexing 61 documents from general with batchId: 1515680487 and scoring method: quality 2018-01-11 14:21:36 Indexing from general index done. 2018-01-11 14:21:36 Enabling replicas... 2018-01-11 14:21:56 Waiting for the index to go green... Green! 2018-01-11 14:21:57 Updating tracking indexes...[cebd4be3bd4e9c30eefa478c] [no req] Exception from line 745 of ...extensions/CirrusSearch/maintenance/updateSuggesterIndex.php: meta store does not exist, you must index your data first Backtrace: #0 ...extensions/CirrusSearch/maintenance/updateSuggesterIndex.php(319): CirrusSearch\Maintenance\UpdateSuggesterIndex->updateVersions() #1 ...extensions/CirrusSearch/maintenance/updateSuggesterIndex.php(240): CirrusSearch\Maintenance\UpdateSuggesterIndex->rebuild() #2 ...maintenance/doMaintenance.php(111): CirrusSearch\Maintenance\UpdateSuggesterIndex->execute() #3 ...extensions/CirrusSearch/maintenance/updateSuggesterIndex.php(793): require_once(string) #4 {main}
The last HTTP request I saw it does before failing is:
HEAD /mw_cirrus_metastore/version HTTP/1.1 Host: 127.0.0.1:9200 Accept: */* Accept-Encoding: deflate, gzip HTTP/1.1 400 Bad Request content-type: text/plain; charset=UTF-8 content-length: 73
I look again to the indices and a bogus wikidexwiki_titlesuggest_1515680487 has been created
green open wikidexwiki_titlesuggest_1515680487 LQlgfNreSsumHQRxL9hvcg 4 0 18997 0 7.1mb 7.1mb green open mw_cirrus_metastore_first 2cf0X6ZpQi6yr6ZE0-6jSA 1 0 3 2 8.5kb 8.5kb green open wikidexwiki_general_first L4Y2SfehQdCg2oaZiT2Ing 4 0 262131 62386 1.9gb 1.9gb green open wikidexwiki_content_first EgLUpqUuS66x1bUvIVtoKw 4 0 14106 2836 739.2mb 739.2mb
I don't know why it tries to get the mw_cirrus_metastore index while all my indices seem to have a _first suffix... Ciencia Al Poder (talk) 14:26, 11 January 2018 (UTC)
- CirrusSearch relies on index aliases and sometimes error messages may refer to these aliases instead of the actual indices. In this case it complains perhaps because the alias mw_cirrus_metastore does not point to mw_cirrus_metastore_first.
- What you describe in your message sounds like a bug in CirrusSearch.
- Could you provide us more informations by dumping the output of:
curl -s localhost:9200/mw_cirrus_metastore_first/_aliases?pretty
- If there are no aliases for this index you may try to fix it by running:
curl -XPOST localhost:9200/_aliases/ -d '{"actions": [{"add": { "alias": "mw_cirrus_metastore", "index": "mw_cirrus_metastore_first"}}]}'
- and rerun the updateSuggesterIndex.php script.
- Thanks for your feedback! DCausse (WMF) (talk) 17:13, 11 January 2018 (UTC)
- Another thing to check would be to verify that the Elastica version you are using is right because I realized that the HTTP request you captured is wrong, it should be HEAD /mw_cirrus_metastore
- Did you get the Elastica extension with the REL1_29 tag as well?
Also could you update your message by adding the version of elasticsearch you use?5.6.5- Thanks! DCausse (WMF) (talk) 17:40, 11 January 2018 (UTC)
- Both were downloaded for REL1_29
- Elastica is version 1.3.0.0. I still have the snapshots:
- Elastica-REL1_29-e2a9593.tar.gz
- CirrusSearch-REL1_29-5ca9036.tar.gz Ciencia Al Poder (talk) 17:50, 11 January 2018 (UTC)
> curl -s localhost:9200/mw_cirrus_metastore_first/_aliases?pretty { "mw_cirrus_metastore_first" : { "aliases" : { "mw_cirrus_metastore" : { } } } }
- The alias exists. I didn't know about aliases.
- So I've tried to do the same request but with GET instead of HEAD and... voila:
> curl localhost:9200/mw_cirrus_metastore/version?pretty { "error" : { "root_cause" : [ { "type" : "illegal_argument_exception", "reason" : "No endpoint or operation is available at [version]" } ], "type" : "illegal_argument_exception", "reason" : "No endpoint or operation is available at [version]" }, "status" : 400 }
- Maybe the /version is only available at specific ES versions? This is mine:
- Ciencia Al Poder (talk) 17:48, 11 January 2018 (UTC)
> curl localhost:9200/?pretty { "name" : "wikidexsearch1-n1", "cluster_name" : "wikidexsearch1", "cluster_uuid" : "evILbJFIQKKMpzIcuII1Bw", "version" : { "number" : "5.6.5", "build_hash" : "6a37571", "build_date" : "2017-12-04T07:50:10.466Z", "build_snapshot" : false, "lucene_version" : "6.6.1" }, "tagline" : "You Know, for Search" }
- I now remember that we had to bump elastica to more recent version when we migrated from elasticsearch 5.3 to 5.5
- I'm afraid that if you want to run MW REL1_29 you'll have to try to downgrade elastic to the latest 5.3 version.
- Another hazardous solution would be to hack cirrus to workaround this problem by changing the function in includes/Maintenance/MetaStoreIndex.php
- from:
public static function updateMetastoreVersions( Connection $connection, $indexBaseName, $indexTypeName ) { $index = self::getVersionType( $connection ); if ( !$index->exists() ) { // <========== This line trigger the bug in elastica throw new \Exception( "meta store does not exist, you must index your data first" ); } $index->addDocument( self::versionData( $connection, $indexBaseName, $indexTypeName ) ); }
- to
public static function updateMetastoreVersions( Connection $connection, $indexBaseName, $indexTypeName ) { $index = self::getVersionType( $connection ); if ( !$index->getIndex()->exists() ) { // hack to workaround incompatibility with elastic 5.5+ throw new \Exception( "meta store does not exist, you must index your data first" ); } $index->addDocument( self::versionData( $connection, $indexBaseName, $indexTypeName ) ); }
- NOTE:' I don't suggest this solution unless you feel comfortable with PHP and also because you may run into other issues in other part of the code due to some incompatibilities with elastic 5.6 and the elastica version shipped with REL1_29. DCausse (WMF) (talk) 18:28, 11 January 2018 (UTC)
- Would it work if I use Elastica (extension) from master or 1.30? or CirrusSearch 1.30? (with MediaWiki 1.29)
- I was planning to upgrade MediaWiki soon, but wanted to prioritize the search before upgrading. Ciencia Al Poder (talk) 18:47, 11 January 2018 (UTC)
- Sadly after a quick check the elastica version used by the Elastica extension on REL1_30 is still 5.1.0 and 5.3.0 is needed. (https://github.com/wikimedia/mediawiki-extensions-Elastica/blob/REL1_30/composer.json#L21)
- The future 1.31 will have the proper version. DCausse (WMF) (talk) 19:19, 11 January 2018 (UTC)
- Ok, so I have to downgrade ES to 5.3. It would be good to clarify that on the Extension:CirrusSearch#Dependencies section, since it currently says 1.29 requires ES 5.3+. I even installed ES 6 before, but then the extension itself said it was not compatible. Ciencia Al Poder (talk) 20:20, 11 January 2018 (UTC)
- Sure,
- thanks again for your feedback. DCausse (WMF) (talk) 09:46, 12 January 2018 (UTC)
Suggestion: Add uploader / author to search results snippet
Issue:
As a user I'd like to see who uploaded a specific file so that I can see more of their content (without digging through file history).
Proposed solution
- Add an author to all search results (including regular pages); and / or
- For files, add only the most recent uploader. 197.218.91.135 (talk) 10:42, 14 February 2018 (UTC)
Suggestion: Make it possible to search by page author /contributor/ uploader
Problems
- As a user, I'd like to discover more files or content made / created by a specific user.
- As a user, I'd like to find specific content without paging through Special:listfiles.
Background
Currently there is no way to distinguish between search results uploaded or created by a specific user. Paging through special:listfiles is not an activity any sane person would do for users with massive uploads, e.g. Special:ListFiles&dir=prev&user=Ruthven. Attempting to view massive new pages by a specific user (special:contribs) will also result in a timeout on a big enough wiki, especially if the namespace parameter is used).
Also, for regular pages, this provides a sensible and easy interface to see and count all (existing) pages created by a user as this would naturally include the matches.
Other Use cases:
- Looking into discussions (Talk pages) participated
- Looking into pages they created with a specific keyword
- Readers looking into interesting pages or media initially created by a specific contributors
- Anti vandalism - looking into pages created / edited by a specific user and containing a specific term.
Proposed solution
- Add a new search keyword "author:", e.g. "author:User1"; AND
- Add a new search keyword "contributor:" to list all pages a edited by particular user;
- Possibly make it possible to include more than one author, e.g. "author:User1|user2|..." or alternatively "author:User1 author:User2"
Note: A file page may be created before a file upload (by another user). So there may be a need to distinguish between an uploader and a file page creator. 197.218.91.135 (talk) 11:08, 14 February 2018 (UTC)
- Hm, this is an interesting proposal. Given that it's a more contributor-focused tool, I wonder if this might be more appropriate for the AdvancedSearch project Wikimedia Deutschland is working on. I don't think "contributor" is a current field, but it might be a welcome suggestion.
- It also sounds a little like maybe an updated Special:Contributions or Special:ListFiles would serve a better job than through CirrusSearch. Given that Special:Search is so general and broad. If other folks from the Search team are reading this, please tell me if I'm wrong!
- IP, I'd be happy to create a phab task or two if you think that would be helpful.
- Humor: Or maybe we need a one-sided Interaction timeline! :) CKoerner (WMF) (talk) 20:30, 20 February 2018 (UTC)
- My guess is that perspective is based on a wikipedia centric view.
- The "contributor" keyword might be more related to editors, but readers are 100% interested in knowing the creator / uploader of a file or page in certain contexts. For instance, in wikibooks, one may be interested in stories (pages) created / published by a specific user. While for wikipedia itself, it often doesn't matter who created the page, knowing who uploaded a specific file is still useful, perhaps a particular user uploads images of new species of animals or some other interesting topic. That is entirely distinct from the photographers, who the uploader may or may not know, the reader may still be interested in seeing more of those rather than simply finding out who photographed one particular creature.
- In the "real" world, it is also very common for people to buy (read / view) books / movies from the same author / writers, exactly because they appreciate their expertise and / or writing style.
- > Special:Contributions or Special:ListFiles would serve a better job than through CirrusSearch. Given that Special:Search is so general and broad.
- Not really. Remember that special:search gives the powerful ability to add extra filters that neither listfiles nor contributions will likely ever have, e.g. "keyword, title, geoip", etc. Also while people do enjoy deceiving themselves, the average person can't deal with vast amounts of data. Those pages have close to infinite paging as a poor man's alternative to the lack of a proper filtering capability.
- > IP, I'd be happy to create a phab task or two if you think that would be helpful.
- Feel free to create them. The feature suggestion is still valid, in my opinion. 197.218.84.219 (talk) 21:19, 22 February 2018 (UTC)
- Fair enough. :)
- I filed a task: https://phabricator.wikimedia.org/T188125
- > the average person can't deal with vast amounts of data.
- An author (to your exact example!) I enjoy once talked/wrote about this. He called it a problem of "filter failure".
- Oh, and since I can't Special:Thank you, let me just state it plainly. Thank you. CKoerner (WMF) (talk) 18:10, 23 February 2018 (UTC)
Version with MediaWiki 1.28
Hi,
MediaWiki | 1.28.2 |
PHP | 7.1.6 (apache2handler) |
MariaDB | 10.1.24-MariaDB |
ICU | 4.8.1.1 |
Elasticsearch | 2.4.5 |
In the download page (Special:ExtensionDistributor/CirrusSearch) I can only download CirrusSearch for versions 27, 29 and 30.
I have tried to install CirrusSearch with versions 27 and 30 but when I execute the updateSearchIndexConfig.php it saids that Elasticsearch version 2.4.5 is not supported.
On the CirrusSearch (Extension:CirrusSearch#Dependencies) page it saids
- MediaWiki 1.28.x requires ElasticSearch 2.x.
Where can I download CirrusSearch for MediaWiki 1.28 and Elasticsearch 2.4.5?
Thanks 195.55.236.138 (talk) 09:55, 7 March 2018 (UTC)
- Use the version from the 1.28 branch. This should work.
- Still I believe it will be best for you to upgrade both MediaWiki and Elasticsarch to supported versions. [[kgh]] (talk) 15:09, 7 March 2018 (UTC)
- Thanks for your response @Kghbln. Unluckily I am not allowed to upgrade MediaWiki.
- I have download the version from the branch as you said. But now this error appears. I have search for this error and seen other people with the same. But no solutions :(
- content index...
- Fetching Elasticsearch version...2.4.5...ok
- Scanning available plugins...none
- Inferring index identifier...[9034f831a5ee88edf680c617] [no req] Error from line 34 of /opt/lamp p/htdocs/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Exception/ResponseException.p hp: Wrong parameters for Elastica\Exception\ResponseException([string $message [, long $code [, Throwable $previous = NULL]]])
- Backtrace:
- #0 /opt/lampp/htdocs/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Exception/Respons eException.php(34): Exception->__construct(array)
- #1 /opt/lampp/htdocs/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Transport/Http.ph p(159): Elastica\Exception\ResponseException->__construct(Elastica\Request, Elastica\Response)
- #2 /opt/lampp/htdocs/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Request.php(171): Elastica\Transport\Http->exec(Elastica\Request, array)
- #3 /opt/lampp/htdocs/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Client.php(621): Elastica\Request->send()
- #4 /opt/lampp/htdocs/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Status.php(163): Elastica\Client->request(string, string)
- #5 /opt/lampp/htdocs/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Status.php(45): E lastica\Status->refresh()
- #6 /opt/lampp/htdocs/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Client.php(454): Elastica\Status->__construct(Elastica\Client)
- #7 /opt/lampp/htdocs/wiki/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php(109): Elasti ca\Client->getStatus()
- #8 /opt/lampp/htdocs/wiki/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php(78): CirrusS earch\Maintenance\ConfigUtils->getAllIndicesByType(string)
- #9 /opt/lampp/htdocs/wiki/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php(260): CirrusSearch\Maintenance\ConfigUtils->pickIndexIdentifierFromOption(string, string)
- #10 /opt/lampp/htdocs/wiki/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(58): Cir rusSearch\Maintenance\UpdateOneSearchIndexConfig->execute()
- #11 /opt/lampp/htdocs/wiki/maintenance/doMaintenance.php(111): CirrusSearch\Maintenance\UpdateSearc hIndexConfig->execute()
- #12 /opt/lampp/htdocs/wiki/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(65): req uire_once(string)
- #13 {main} 195.55.236.138 (talk) 09:11, 8 March 2018 (UTC)
- I guess you are out of luck. I cannot tell what is wrong and I doubt that the developers will address issues for unsupported branches of MediaWiki. As a matter of fact I would already be happy if that was done for supported branches of MediaWiki. However, one never knows. [[kgh]] (talk) 12:28, 8 March 2018 (UTC)
- This looks like a bug in Elastica itself, could you make sure that the Elastica extension is also 1.28 and that composer update has been run properly? DCausse (WMF) (talk) 15:53, 9 March 2018 (UTC)
Suggestion: Include creation date for files (including exif date) / pages in search result snippet (metadata)
Issue:
As a user, I expect the file upload date, and page creation date to be included in search results.
Background
As a user recently looking through files, I was confused about the date in the search results for files. I expected it to be the most recent upload date but instead it is last time the related file page was edited.
Other use cases:
- Verifying whether the information in snippet is actually accurate. A page last edited 5 minutes ago is likely to contain a lot of inaccuracies and potential misinformation.
- An image upload date can easily be used to evaluate information, e.g. an image caption incorrectly claiming X happened on Y date.
Proposed solution
Include metadata in search results snippet (where appropriate / available):
- Exif date - for latest media upload
- Upload date - for latest media upload
- Page creation date 197.218.88.6 (talk) 11:24, 11 March 2018 (UTC)
Issue with MW 1.30 and SPARQL client class
Hi,
I have recently updated my wiki from 1.27.4 to 1.30. Since I was using CirrusSearch I had to update elasticsearch according to the new version. So now my installation is the following:
Product | Version |
---|---|
MediaWiki | 1.30.0 |
PHP | 7.0.25-0ubuntu0.16.04.1 (apache2handler) |
MySQL | 5.7.21-0ubuntu0.16.04.1 |
ICU | 55.1 |
Elasticsearch | 5.4.3 |
Lua | 5.1.5 |
I have installed the master version of both CirrusSearch and Elastica, updated composer and LocalSettings.php. Now, when I made a search on my wiki I get this error:
[aa787554751705ac2246e772] /mediawiki/index.php?title=Special%3ASearch&search=kircher&go=Go Error from line 14 of /var/lib/mediawiki/extensions/CirrusSearch/includes/ServiceWiring.php: Class 'MediaWiki\Sparql\SparqlClient' not found
Backtrace:
#0 [internal function]: MediaWiki\Services\ServiceContainer->{closure}(MediaWiki\MediaWikiServices)
#1 /var/lib/mediawiki/includes/services/ServiceContainer.php(360): call_user_func_array(Closure, array)
#2 /var/lib/mediawiki/includes/services/ServiceContainer.php(344): MediaWiki\Services\ServiceContainer->createService(string)
#3 /var/lib/mediawiki/extensions/CirrusSearch/includes/Parser/FullTextKeywordRegistry.php(77): MediaWiki\Services\ServiceContainer->getService(string)
#4 /var/lib/mediawiki/extensions/CirrusSearch/includes/Searcher.php(276): CirrusSearch\Parser\FullTextKeywordRegistry->__construct(CirrusSearch\SearchConfig)
#5 /var/lib/mediawiki/extensions/CirrusSearch/includes/Searcher.php(318): CirrusSearch\Searcher->buildFullTextSearch(string, boolean)
#6 /var/lib/mediawiki/extensions/CirrusSearch/includes/CirrusSearch.php(384): CirrusSearch\Searcher->searchText(string, boolean)
#7 /var/lib/mediawiki/extensions/CirrusSearch/includes/CirrusSearch.php(175): CirrusSearch->searchTextReal(string, CirrusSearch\SearchConfig)
#8 /var/lib/mediawiki/includes/specials/SpecialSearch.php(319): CirrusSearch->searchText(string)
#9 /var/lib/mediawiki/includes/specials/SpecialSearch.php(185): SpecialSearch->showResults(string)
#10 /var/lib/mediawiki/includes/specialpage/SpecialPage.php(522): SpecialSearch->execute(NULL)
#11 /var/lib/mediawiki/includes/specialpage/SpecialPageFactory.php(578): SpecialPage->run(NULL)
#12 /var/lib/mediawiki/includes/MediaWiki.php(287): SpecialPageFactory::executePath(Title, RequestContext)
#13 /var/lib/mediawiki/includes/MediaWiki.php(851): MediaWiki->performRequest()
#14 /var/lib/mediawiki/includes/MediaWiki.php(523): MediaWiki->main()
#15 /var/lib/mediawiki/index.php(43): MediaWiki->run()
#16 {main}
It seems there is an issue with mediawiki SPARQL client; I have read something about this in recent discussion, but this doesn't help me. Any ideas about how solve this issue?
Thanks,
Lorenzo Loman87 (talk) 11:30, 14 March 2018 (UTC)
- Are you sure that you are using CirrusSearch on REL_1_30, SparqlClient was added in 1.31 in both core and CirrusSearch? DCausse (WMF) (talk) 09:51, 15 March 2018 (UTC)
- Hi,
- thanks for your answer. I downloaded CirrusSearch using the extension distributor, so I guess it is the right version. Anyway I will do some other attempts and see what happens... Loman87 (talk) 13:03, 21 March 2018 (UTC)
number_format_exception: For input string: "0,7" (solved)
RESOLVED | |
Bug in cirrus: T189877, will be backported to 1.30 soon, see the thread for workaround. |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Hi! I have a trouble after upgrade my MW to 1.30.
Search backend error during prefix search for 'search query here' after 3: number_format_exception: For input string: "0,7"
I deleted indicies in elasticsearch, then created them again, in order with instruction, but still have this error :(
Need help :(
Product | Version |
---|---|
MediaWiki | 1.30.0 |
PHP | 7.1.14 (fpm-fcgi) |
MariaDB | 10.1.31-MariaDB-1~xenial |
Elasticsearch | 5.3.3 |
- Or we can receive
- Search backend error during full_text search for 'gdfg' after 2: number_format_exception: For input string: "0,5"
- Why it happens? StRiANON (talk) 11:42, 24 March 2018 (UTC)
- Have you made any configuration change to the CirrusSearch configuration, I wonder if there are some weights passed to elastic that uses a comma instead of a period for decimal separator.
- One way to help us determine if the problem is related to number format would be to paste your config (can be dumped using
api.php?action=cirrus-config-dump
). - If the error happens for fulltext search could you also paste the output of the search result page adding
&cirrusDumpQuery
to the search URL. - Thanks! DCausse (WMF) (talk) 13:54, 24 March 2018 (UTC)
- You can see it here
- I didn't change CirrusSearch params, and for elasticsearch added only few general rules,
script.inline: true
script.stored: true
action.auto_create_index: false
- And fulltext error's result - https://pastebin.com/sT3Vydx7 StRiANON (talk) 14:14, 24 March 2018 (UTC)
- While your config seems sane I see a weight with a comma in the fulltext query:
"weight": "0,2"
This might cause issue on elastic side. I suspect a bug in cirrus or some underlying library that transform this weight to a string by using system locale.- Out of curiosity: is your system using a LOCALE set to something that uses comma for decimal separator?
- A quick workaround would be to set:
$wgCirrusSearchDefaultNamespaceWeight = 1; $wgCirrusSearchTalkNamespaceWeight = 1;
- So that we stick non decimal numbers.
- I may have found the culprit in Cirrus code, I'll followup there with a fix.
- Thanks for your report. DCausse (WMF) (talk) 14:35, 24 March 2018 (UTC)
- Finally detected this trouble. Thx for idea about locale. Problem was in $wgShellLocale, which work was changed in 1.30 and now it affected lc_all instead of lc_ctype previously and so now decimal separator affects in scripts. Just removed this param and now all is ok. StRiANON (talk) 17:40, 24 March 2018 (UTC)
- No, my locale is en_US.UTF-8, checked by printf - uses dot.
- Unfortunately, solution didn't helps :( I added this two params, then deleted and created indicies again - still same error.
- Then I added more rules for search weights
- And again deleted and created indicies. And still have this trouble - look here o.O StRiANON (talk) 16:41, 24 March 2018 (UTC)
$wgCirrusSearchDefaultNamespaceWeight = 1; $wgCirrusSearchTalkNamespaceWeight = 1; $wgCirrusSearchWeights = [ 'title' => 20, 'redirect' => 15, 'category' => 8, 'heading' => 5, 'opening_text' => 3, 'text' => 1, 'auxiliary_text' => 1, 'file_text' => 1, ]; $wgCirrusSearchPrefixWeights = [ 'title' => 10, 'redirect' => 1, 'title_asciifolding' => 7, 'redirect_asciifolding' => 1, ];
Support for Elasticsearch 6.x.x?
Hi there,
I am running a modern instance of Elasticsearch, specifically, 6.2.3. I noticed that only 5.x.x versions of Elasticsearch are supported with this extension. Are there any plans to bring the extension up-to-date with the new generation of Elastic?
Cheers! TorontonianOnlines (talk) 19:41, 26 March 2018 (UTC)
- The docu says "MediaWiki 1.31.x requires ElasticSearch 5.5+." I read this in a way that 6.x.x will be supported via CirrusSearch for MW 1.31 which is due end of May. Keeping fingers crossed. [[kgh]] (talk) 20:14, 26 March 2018 (UTC)
- That would be wonderful news! As is, I have been informed I am not allowed to use Cirrus at my org. TorontonianOnlines (talk) 20:23, 26 March 2018 (UTC)
- Elastic does not guarantee compatibility between major versions. In fact it's nearly impossible for us to support multiple major versions of elastic (there are too many breaking changes).
- So I'm sorry to say that no MW 1.31 won't support elastic 6.x :(
- Back to the original question: yes we have plans to upgrade to elastic 6.x the timeline is not yet very precise. DCausse (WMF) (talk) 08:56, 27 March 2018 (UTC)
- Thanks for clarifying. Apparently I was in high hopes for MW 1.31+ because of 6.x. :| I just fixed the docu. [[kgh]] (talk) 09:04, 27 March 2018 (UTC)
updateSearchIndexConfig.php ( Elastic Search version 5.3 ) and MW 1.30
RESOLVED | |
Bug in cirrus see T191493 |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
We have been trying to execute the updateSearchIndexConfig.php and this fails with the following
php updateSearchIndexConfig.php
content index...
Fetching Elasticsearch version...5.3.2...ok
Scanning available plugins...PHP Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
PHP Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
PHP Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
none
Inferring index identifier...bitnami_mediawiki_content_first
Picking analyzer...english
Validating number of shards...ok
Validating replica range...ok
Validating shard allocation settings...done
Validating max shards per node...ok
Validating analyzers...ok
Validating mappings...
Validating mapping...ok
Validating aliases...
Validating bitnami_mediawiki_content alias...ok
Validating bitnami_mediawiki alias...ok
Updating tracking indexes...done
general index...
Fetching Elasticsearch version...5.3.2...ok
Scanning available plugins...PHP Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
PHP Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
PHP Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
Warning: Invalid argument supplied for foreach() in /opt/bitnami/apps/mediawiki/htdocs/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php on line 130
none
Inferring index identifier...bitnami_mediawiki_general_first
Picking analyzer...english
Validating number of shards...ok
Validating replica range...ok
Validating shard allocation settings...done
Validating max shards per node...ok
Validating analyzers...ok
Validating mappings...
Validating mapping...ok
Validating aliases...
Validating bitnami_mediawiki_general alias...ok
Validating bitnami_mediawiki alias...ok
Updating tracking indexes...done
Deleting namespaces...done
Indexing namespaces...done 164.144.248.26 (talk) 15:45, 3 April 2018 (UTC)
- If this is possible for you could paste the output of the command:
curl -s localhost:9200/_nodes?pretty
- replace localhost with the hostname of one node of your elasticsearch cluster.
- Thanks. DCausse (WMF) (talk) 08:37, 4 April 2018 (UTC)
- $ curl -s vpc-np-es-psd.us-east-1.ps.amazonaws.com:80/_nodes?pretty
- {
- "_nodes" : {
- "total" : 3,
- "successful" : 3,
- "failed" : 0
- },
- "cluster_name" : "265365382492:haystack-np-es",
- "nodes" : {
- "mpy2CEk2TLWnBUGbSExtJA" : {
- "name" : "mpy2CEk",
- "version" : "5.3.2",
- "build_hash" : "Unknown",
- "total_indexing_buffer" : 427753472,
- "roles" : [ "master", "data", "ingest" ],
- "os" : {
- "refresh_interval_in_millis" : 1000,
- "available_processors" : 2,
- "allocated_processors" : 2
- },
- "process" : {
- "refresh_interval_in_millis" : 1000,
- "id" : 9734,
- "mlockall" : true
- },
- "jvm" : {
- "pid" : 9734,
- "start_time_in_millis" : 1522089530084,
- "mem" : {
- "heap_init_in_bytes" : 4294967296,
- "heap_max_in_bytes" : 4277534720,
- "non_heap_init_in_bytes" : 2555904,
- "non_heap_max_in_bytes" : 0,
- "direct_max_in_bytes" : 4277534720
- },
- "using_compressed_ordinary_object_pointers" : "true"
- },
- "thread_pool" : {
- "force_merge" : {
- "type" : "fixed",
- "min" : 1,
- "max" : 1,
- "queue_size" : -1
- },
- "fetch_shard_started" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 4,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "listener" : {
- "type" : "fixed",
- "min" : 1,
- "max" : 1,
- "queue_size" : -1
- },
- "index" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 200
- },
- "refresh" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "generic" : {
- "type" : "scaling",
- "min" : 4,
- "max" : 128,
- "keep_alive" : "30s",
- "queue_size" : -1
- },
- "warmer" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "search" : {
- "type" : "fixed",
- "min" : 4,
- "max" : 4,
- "queue_size" : 1000
- },
- "flush" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "fetch_shard_store" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 4,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "management" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 5,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "get" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 1000
- },
- "bulk" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 200
- },
- "snapshot" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- }
- },
- "modules" : [ {
- "name" : "aggs-matrix-stats",
- "version" : "5.3.2",
- "description" : "Adds aggregations whose input are a list of numeric fields and output includes a matrix.",
- "classname" : "org.elasticsearch.search.aggregations.matrix.MatrixAggregationPlugin"
- }, {
- "name" : "ingest-common",
- "version" : "5.3.2",
- "description" : "Module for ingest processors that do not require additional security permissions or have large dependencies and resources",
- "classname" : "org.elasticsearch.ingest.common.IngestCommonPlugin"
- }, {
- "name" : "lang-expression",
- "version" : "5.3.2",
- "description" : "Lucene expressions integration for Elasticsearch",
- "classname" : "org.elasticsearch.script.expression.ExpressionPlugin"
- }, {
- "name" : "lang-mustache",
- "version" : "5.3.2",
- "description" : "Mustache scripting integration for Elasticsearch",
- "classname" : "org.elasticsearch.script.mustache.MustachePlugin"
- }, {
- "name" : "lang-painless",
- "version" : "5.3.2",
- "description" : "An easy, safe and fast scripting language for Elasticsearch",
- "classname" : "org.elasticsearch.painless.PainlessPlugin"
- }, {
- "name" : "percolator",
- "version" : "5.3.2",
- "description" : "Percolator module adds capability to index queries and query these queries by specifying documents",
- "classname" : "org.elasticsearch.percolator.PercolatorPlugin"
- }, {
- "name" : "reindex",
- "version" : "5.3.2",
- "description" : "The Reindex module adds APIs to reindex from one index to another or update documents in place.",
- "classname" : "org.elasticsearch.index.reindex.ReindexPlugin"
- }, {
- "name" : "transport-netty3",
- "version" : "5.3.2",
- "description" : "Netty 3 based transport implementation",
- "classname" : "org.elasticsearch.transport.Netty3Plugin"
- }, {
- "name" : "transport-netty4",
- "version" : "5.3.2",
- "description" : "Netty 4 based transport implementation",
- "classname" : "org.elasticsearch.transport.Netty4Plugin"
- } ],
- "ingest" : {
- "processors" : [ {
- "type" : "append"
- }, {
- "type" : "attachment"
- }, {
- "type" : "convert"
- }, {
- "type" : "date"
- }, {
- "type" : "date_index_name"
- }, {
- "type" : "dot_expander"
- }, {
- "type" : "fail"
- }, {
- "type" : "foreach"
- }, {
- "type" : "grok"
- }, {
- "type" : "gsub"
- }, {
- "type" : "join"
- }, {
- "type" : "json"
- }, {
- "type" : "kv"
- }, {
- "type" : "lowercase"
- }, {
- "type" : "remove"
- }, {
- "type" : "rename"
- }, {
- "type" : "script"
- }, {
- "type" : "set"
- }, {
- "type" : "sort"
- }, {
- "type" : "split"
- }, {
- "type" : "trim"
- }, {
- "type" : "uppercase"
- }, {
- "type" : "user_agent"
- } ]
- }
- },
- "LLLQg4hgTtu2FEmnV_inTA" : {
- "name" : "LLLQg4h",
- "version" : "5.3.2",
- "build_hash" : "Unknown",
- "total_indexing_buffer" : 427753472,
- "roles" : [ "master", "data", "ingest" ],
- "os" : {
- "refresh_interval_in_millis" : 1000,
- "available_processors" : 2,
- "allocated_processors" : 2
- },
- "process" : {
- "refresh_interval_in_millis" : 1000,
- "id" : 9842,
- "mlockall" : true
- },
- "jvm" : {
- "pid" : 9842,
- "start_time_in_millis" : 1522089504901,
- "mem" : {
- "heap_init_in_bytes" : 4294967296,
- "heap_max_in_bytes" : 4277534720,
- "non_heap_init_in_bytes" : 2555904,
- "non_heap_max_in_bytes" : 0,
- "direct_max_in_bytes" : 4277534720
- },
- "using_compressed_ordinary_object_pointers" : "true"
- },
- "thread_pool" : {
- "force_merge" : {
- "type" : "fixed",
- "min" : 1,
- "max" : 1,
- "queue_size" : -1
- },
- "fetch_shard_started" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 4,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "listener" : {
- "type" : "fixed",
- "min" : 1,
- "max" : 1,
- "queue_size" : -1
- },
- "index" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 200
- },
- "refresh" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "generic" : {
- "type" : "scaling",
- "min" : 4,
- "max" : 128,
- "keep_alive" : "30s",
- "queue_size" : -1
- },
- "warmer" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "search" : {
- "type" : "fixed",
- "min" : 4,
- "max" : 4,
- "queue_size" : 1000
- },
- "flush" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "fetch_shard_store" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 4,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "management" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 5,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "get" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 1000
- },
- "bulk" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 200
- },
- "snapshot" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- }
- },
- "modules" : [ {
- "name" : "aggs-matrix-stats",
- "version" : "5.3.2",
- "description" : "Adds aggregations whose input are a list of numeric fields and output includes a matrix.",
- "classname" : "org.elasticsearch.search.aggregations.matrix.MatrixAggregationPlugin"
- }, {
- "name" : "ingest-common",
- "version" : "5.3.2",
- "description" : "Module for ingest processors that do not require additional security permissions or have large dependencies and resources",
- "classname" : "org.elasticsearch.ingest.common.IngestCommonPlugin"
- }, {
- "name" : "lang-expression",
- "version" : "5.3.2",
- "description" : "Lucene expressions integration for Elasticsearch",
- "classname" : "org.elasticsearch.script.expression.ExpressionPlugin"
- }, {
- "name" : "lang-mustache",
- "version" : "5.3.2",
- "description" : "Mustache scripting integration for Elasticsearch",
- "classname" : "org.elasticsearch.script.mustache.MustachePlugin"
- }, {
- "name" : "lang-painless",
- "version" : "5.3.2",
- "description" : "An easy, safe and fast scripting language for Elasticsearch",
- "classname" : "org.elasticsearch.painless.PainlessPlugin"
- }, {
- "name" : "percolator",
- "version" : "5.3.2",
- "description" : "Percolator module adds capability to index queries and query these queries by specifying documents",
- "classname" : "org.elasticsearch.percolator.PercolatorPlugin"
- }, {
- "name" : "reindex",
- "version" : "5.3.2",
- "description" : "The Reindex module adds APIs to reindex from one index to another or update documents in place.",
- "classname" : "org.elasticsearch.index.reindex.ReindexPlugin"
- }, {
- "name" : "transport-netty3",
- "version" : "5.3.2",
- "description" : "Netty 3 based transport implementation",
- "classname" : "org.elasticsearch.transport.Netty3Plugin"
- }, {
- "name" : "transport-netty4",
- "version" : "5.3.2",
- "description" : "Netty 4 based transport implementation",
- "classname" : "org.elasticsearch.transport.Netty4Plugin"
- } ],
- "ingest" : {
- "processors" : [ {
- "type" : "append"
- }, {
- "type" : "attachment"
- }, {
- "type" : "convert"
- }, {
- "type" : "date"
- }, {
- "type" : "date_index_name"
- }, {
- "type" : "dot_expander"
- }, {
- "type" : "fail"
- }, {
- "type" : "foreach"
- }, {
- "type" : "grok"
- }, {
- "type" : "gsub"
- }, {
- "type" : "join"
- }, {
- "type" : "json"
- }, {
- "type" : "kv"
- }, {
- "type" : "lowercase"
- }, {
- "type" : "remove"
- }, {
- "type" : "rename"
- }, {
- "type" : "script"
- }, {
- "type" : "set"
- }, {
- "type" : "sort"
- }, {
- "type" : "split"
- }, {
- "type" : "trim"
- }, {
- "type" : "uppercase"
- }, {
- "type" : "user_agent"
- } ]
- }
- },
- "Zjmu23EvSkmxx2Bp2D6Tpw" : {
- "name" : "Zjmu23E",
- "version" : "5.3.2",
- "build_hash" : "Unknown",
- "total_indexing_buffer" : 427753472,
- "roles" : [ "master", "data", "ingest" ],
- "os" : {
- "refresh_interval_in_millis" : 1000,
- "available_processors" : 2,
- "allocated_processors" : 2
- },
- "process" : {
- "refresh_interval_in_millis" : 1000,
- "id" : 9785,
- "mlockall" : true
- },
- "jvm" : {
- "pid" : 9785,
- "start_time_in_millis" : 1522089516908,
- "mem" : {
- "heap_init_in_bytes" : 4294967296,
- "heap_max_in_bytes" : 4277534720,
- "non_heap_init_in_bytes" : 2555904,
- "non_heap_max_in_bytes" : 0,
- "direct_max_in_bytes" : 4277534720
- },
- "using_compressed_ordinary_object_pointers" : "true"
- },
- "thread_pool" : {
- "force_merge" : {
- "type" : "fixed",
- "min" : 1,
- "max" : 1,
- "queue_size" : -1
- },
- "fetch_shard_started" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 4,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "listener" : {
- "type" : "fixed",
- "min" : 1,
- "max" : 1,
- "queue_size" : -1
- },
- "index" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 200
- },
- "refresh" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "generic" : {
- "type" : "scaling",
- "min" : 4,
- "max" : 128,
- "keep_alive" : "30s",
- "queue_size" : -1
- },
- "warmer" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "search" : {
- "type" : "fixed",
- "min" : 4,
- "max" : 4,
- "queue_size" : 1000
- },
- "flush" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "fetch_shard_store" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 4,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "management" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 5,
- "keep_alive" : "5m",
- "queue_size" : -1
- },
- "get" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 1000
- },
- "bulk" : {
- "type" : "fixed",
- "min" : 2,
- "max" : 2,
- "queue_size" : 200
- },
- "snapshot" : {
- "type" : "scaling",
- "min" : 1,
- "max" : 1,
- "keep_alive" : "5m",
- "queue_size" : -1
- }
- },
- "modules" : [ {
- "name" : "aggs-matrix-stats",
- "version" : "5.3.2",
- "description" : "Adds aggregations whose input are a list of numeric fields and output includes a matrix.",
- "classname" : "org.elasticsearch.search.aggregations.matrix.MatrixAggregationPlugin"
- }, {
- "name" : "ingest-common",
- "version" : "5.3.2",
- "description" : "Module for ingest processors that do not require additional security permissions or have large dependencies and resources",
- "classname" : "org.elasticsearch.ingest.common.IngestCommonPlugin"
- }, {
- "name" : "lang-expression",
- "version" : "5.3.2",
- "description" : "Lucene expressions integration for Elasticsearch",
- "classname" : "org.elasticsearch.script.expression.ExpressionPlugin"
- }, {
- "name" : "lang-mustache",
- "version" : "5.3.2",
- "description" : "Mustache scripting integration for Elasticsearch",
- "classname" : "org.elasticsearch.script.mustache.MustachePlugin"
- }, {
- "name" : "lang-painless",
- "version" : "5.3.2",
- "description" : "An easy, safe and fast scripting language for Elasticsearch",
- "classname" : "org.elasticsearch.painless.PainlessPlugin"
- }, {
- "name" : "percolator",
- "version" : "5.3.2",
- "description" : "Percolator module adds capability to index queries and query these queries by specifying documents",
- "classname" : "org.elasticsearch.percolator.PercolatorPlugin"
- }, {
- "name" : "reindex",
- "version" : "5.3.2",
- "description" : "The Reindex module adds APIs to reindex from one index to another or update documents in place.",
- "classname" : "org.elasticsearch.index.reindex.ReindexPlugin"
- }, {
- "name" : "transport-netty3",
- "version" : "5.3.2",
- "description" : "Netty 3 based transport implementation",
- "classname" : "org.elasticsearch.transport.Netty3Plugin"
- }, {
- "name" : "transport-netty4",
- "version" : "5.3.2",
- "description" : "Netty 4 based transport implementation",
- "classname" : "org.elasticsearch.transport.Netty4Plugin"
- } ],
- "ingest" : {
- "processors" : [ {
- "type" : "append"
- }, {
- "type" : "attachment"
- }, {
- "type" : "convert"
- }, {
- "type" : "date"
- }, {
- "type" : "date_index_name"
- }, {
- "type" : "dot_expander"
- }, {
- "type" : "fail"
- }, {
- "type" : "foreach"
- }, {
- "type" : "grok"
- }, {
- "type" : "gsub"
- }, {
- "type" : "join"
- }, {
- "type" : "json"
- }, {
- "type" : "kv"
- }, {
- "type" : "lowercase"
- }, {
- "type" : "remove"
- }, {
- "type" : "rename"
- }, {
- "type" : "script"
- }, {
- "type" : "set"
- }, {
- "type" : "sort"
- }, {
- "type" : "split"
- }, {
- "type" : "trim"
- }, {
- "type" : "uppercase"
- }, {
- "type" : "user_agent"
- } ]
- }
- }
- } 164.144.252.28 (talk) 17:35, 4 April 2018 (UTC)
- Thanks,
- the response does not include the
plugins
section and this confuses CirrusSearch. I'll create a task to fix this. - Unless you discovered other problems this should not affect the behavior of Cirrus. DCausse (WMF) (talk) 08:02, 5 April 2018 (UTC)
updateSearchIndexConfig.php ( Elastic Search version 5.3 ) and MW 1.30
/CirrusSearch/maintenance$ php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now
content index...
Fetching Elasticsearch version...5.3.2...ok
Setting index identifier...bitnami_mediawiki_content_1522961068
Picking analyzer...english
Creating index...⧼Custom Analyzer [plain] failed to find filter under name [preserve_original_recorder]⧽ 164.144.248.27 (talk) 20:48, 5 April 2018 (UTC)
- This error is more problematic, out of curiosity did you just install the analysis-icu plugin?
- I'll file a task since it seems that cirrus wrongly assumes that if the analysis-icu is installed it can use some features provided by another plugin (wmf search-extra).
- We will try to backport the fix to 1.30 but in the meantime a possible workaround would be to install the search-extra plugin by running:
./bin/elasticsearch-plugin install org.wikimedia.search:extra:5.3.2
- On your elasticsearch nodes. DCausse (WMF) (talk) 22:17, 5 April 2018 (UTC)
- We are using Amazon elastic search domain for this work and we have not installed elasticsearch-plugin and do not have control over the ESDomain. Nagaindukuri (talk) 14:20, 6 April 2018 (UTC)
- I see that ICU is supported by amazon and having ICU could explain the issue.
- Could you try to force disable ICU by setting
$wgCirrusSearchUseIcuFolding = 'no';
in your wiki configuration and see if it fixes the issue? DCausse (WMF) (talk) 16:54, 6 April 2018 (UTC) - maintenance$ php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier=now
- content index...
- Fetching Elasticsearch version...5.3.2...ok
- Scanning available plugins...array(3) {
- Setting index identifier...bitnami_mediawiki_content_1523286431
- Picking analyzer...english
- Creating index...ok
- Validating number of shards...ok
- Validating replica range...ok
- Validating shard allocation settings...done
- Validating max shards per node...ok
- Validating analyzers...ok
- Validating mappings...
- Validating mapping...different...corrected
- Validating aliases...
- Validating bitnami_mediawiki_content alias...is taken...
- Reindexing...
- Unknown reindex failure: 403 164.144.248.29 (talk) 15:09, 9 April 2018 (UTC)
- You seem to not be allowed to use the /_reindex endpoint, could you double check your cluster settings or with amazon support?
- The AWS environment seems to be very restrictive, one workaround for you would be not to use the in-place reindex (--reindexAndRemoveOk) and always reindex your wiki using:
updateSearchIndexConfig.php --startOver
then use theforceSearchIndex.php
script you used during the initial setup. DCausse (WMF) (talk) 17:15, 9 April 2018 (UTC) - content index...
- Fetching Elasticsearch version...5.3.2...ok
- Scanning available plugins...
- Inferring index identifier...error
- Looks like the index has more than one identifier. You should delete all
- but the one of them currently active. Here is the list: bitnami_mediawiki_content_1523286244,bitnami_mediawiki_content_1523286240,bitnami_mediawiki_content_first,bitnami_mediawiki_content_1523285364,bitnami_mediawiki_content_1523285326,bitnami_mediawiki_content_1523286431,bitnami_mediawiki_content_1523285727,bitnami_mediawiki_content_1523286151 164.144.248.29 (talk) 18:47, 9 April 2018 (UTC)
- After the above .. we did run php forceSearchIndex.php --skipLinks --indexOnSkip and it is still under progress. 164.144.248.29 (talk) 18:58, 9 April 2018 (UTC)
- we are seeing that indexes are not being created post updateSearchIndexConfig.php --startOver Nagaindukuri (talk) 20:30, 10 April 2018 (UTC)
- Inferring index identifier...bitnami_mediawiki_content_first
- Picking analyzer...english
- Blowing away index to start over...ok
- Validating number of shards...ok
- Validating replica range...ok
- Validating shard allocation settings...done
- Validating max shards per node...ok
- Validating analyzers...ok
- Validating mappings...
- Validating mapping...different...corrected
- Validating aliases...
- Validating bitnami_mediawiki_content alias...alias is free...corrected
- Validating bitnami_mediawiki alias...alias not already assigned to this index...corrected
- Updating tracking indexes...done
- general index...
- Fetching Elasticsearch version...5.3.2...ok
- Scanning available plugins...array(3) {
- Inferring index identifier...bitnami_mediawiki_general_first
- Picking analyzer...english
- Blowing away index to start over...ok
- Validating number of shards...ok
- Validating replica range...ok
- Validating shard allocation settings...done
- Validating max shards per node...ok
- Validating analyzers...ok
- Validating mappings...
- Validating mapping...different...corrected
- Validating aliases...
- Validating bitnami_mediawiki_general alias...alias is free...corrected
- Validating bitnami_mediawiki alias...alias not already assigned to this index...corrected
- Updating tracking indexes...done
- Deleting namespaces...done
- Indexing namespaces...done
- We are running the forceSearchIndex.php currently and We will update you .... sorry about this. Nagaindukuri (talk) 20:45, 10 April 2018 (UTC)
Search fails with the return boolean ( Amazon Elastic Search version 5.3 ) MW 1.30
Is there a way to test the search for elastic search and search type is Cirrus Search.
[ff7caa56e7c311254efb8e81] /index.php?search=SAP Error from line 474 of /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php: Call to a member function searchContainedSyntax() on boolean
Backtrace:
#0 /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php(384): SpecialSearch->showCreateLink(Title, integer, NULL, boolean)
#1 /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php(185): SpecialSearch->showResults(string)
#2 /opt/bitnami/apps/mediawiki/htdocs/includes/specialpage/SpecialPage.php(522): SpecialSearch->execute(NULL)
#3 /opt/bitnami/apps/mediawiki/htdocs/includes/specialpage/SpecialPageFactory.php(578): SpecialPage->run(NULL)
#4 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(287): SpecialPageFactory::executePath(Title, RequestContext)
#5 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(851): MediaWiki->performRequest()
#6 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(523): MediaWiki->main()
#7 /opt/bitnami/apps/mediawiki/htdocs/index.php(43): MediaWiki->run()
#8 {main} Nagaindukuri (talk) 14:27, 10 April 2018 (UTC)
- https://www.myproject.com/Main_Page?search=biw&title=Special:Search&profile=default&fulltext=1&cirrusDumpResult
- flase Nagaindukuri (talk) 16:41, 19 April 2018 (UTC)
CirrusSearch - Special search fails even after creating index
AWS ES - 5.3.2
Bitnami Media wiki 1.30 and Cirrus search extension.
Internal error
[04e462859a269b1b57b048c5] /index.php?search=%22Hary%22 Error from line 474 of /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php: Call to a member function searchContainedSyntax() on boolean
Backtrace:
#0 /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php(384): SpecialSearch->showCreateLink(Title, integer, NULL, boolean)
#1 /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php(185): SpecialSearch->showResults(string)
#2 /opt/bitnami/apps/mediawiki/htdocs/includes/specialpage/SpecialPage.php(522): SpecialSearch->execute(NULL)
#3 /opt/bitnami/apps/mediawiki/htdocs/includes/specialpage/SpecialPageFactory.php(578): SpecialPage->run(NULL)
#4 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(287): SpecialPageFactory::executePath(Title, RequestContext)
#5 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(851): MediaWiki->performRequest()
#6 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(523): MediaWiki->main()
#7 /opt/bitnami/apps/mediawiki/htdocs/index.php(43): MediaWiki->run()
#8 {main} Nagaindukuri (talk) 18:49, 11 April 2018 (UTC)
- It is hard to tell what is happening behind this error.
- Would it be possible for you to investigate further using debug options provided by CirrusSearch, appending the following URI params to the search request URI will:
- output the elsticsearch result:
&cirrusDumpResult
- output the elasticsearch query sent:
&cirrusDumpQuery
- output the elsticsearch result:
- I'd also suggest trying to read the various logs you have access to such as mediawiki logs and elasticsearch logs, they may provide clearer information that would help to debug your issue. DCausse (WMF) (talk) 08:36, 12 April 2018 (UTC)
- we did a query dump and dump result for Cirrus and here is the out put below.
- {
- "description": "full_text search for 'breeding'",
- "path": "bitnami_mediawiki\/page\/_search",
- "params": {
- "timeout": "20s",
- "search_type": "dfs_query_then_fetch"
- },
- "query": {
- "_source": [
- "namespace",
- "title",
- "namespace_text",
- "wiki",
- "redirect.*",
- "timestamp",
- "text_bytes"
- ],
- "stored_fields": [
- "text.word_count"
- ],
- "query": {
- "bool": {
- "minimum_should_match": 1,
- "should": [
- {
- "query_string": {
- "query": "breeding",
- "fields": [
- "all.plain^1",
- "all^0.5"
- ],
- "auto_generate_phrase_queries": true,
- "phrase_slop": 0,
- "default_operator": "AND",
- "allow_leading_wildcard": true,
- "fuzzy_prefix_length": 2,
- "rewrite": "top_terms_boost_1024"
- }
- },
- {
- "multi_match": {
- "fields": [
- "all_near_match^2"
- ],
- "query": "breeding"
- }
- }
- ],
- "filter": [
- {
- "terms": {
- "namespace": [
- 0,
- 1,
- 2,
- 3,
- 4,
- 5,
- 6,
- 7,
- 8,
- 9,
- 10,
- 11,
- 12,
- 13,
- 14,
- 15,
- 3000,
- 3001
- ]
- }
- }
- ]
- }
- },
- "highlight": {
- "pre_tags": [
- "<span class=\"searchmatch\">"
- ],
- "post_tags": [
- "<\/span>"
- ],
- "fields": {
- "title": {
- "number_of_fragments": 0,
- "type": "fvh",
- "order": "score",
- "matched_fields": [
- "title",
- "title.plain"
- ]
- },
- "redirect.title": {
- "number_of_fragments": 1,
- "fragment_size": 10000,
- "type": "fvh",
- "order": "score",
- "matched_fields": [
- "redirect.title",
- "redirect.title.plain"
- ]
- },
- "category": {
- "number_of_fragments": 1,
- "fragment_size": 10000,
- "type": "fvh",
- "order": "score",
- "matched_fields": [
- "category",
- "category.plain"
- ]
- },
- "heading": {
- "number_of_fragments": 1,
- "fragment_size": 10000,
- "type": "fvh",
- "order": "score",
- "matched_fields": [
- "heading",
- "heading.plain"
- ]
- },
- "text": {
- "number_of_fragments": 1,
- "fragment_size": 150,
- "type": "fvh",
- "order": "score",
- "no_match_size": 150,
- "matched_fields": [
- "text",
- "text.plain"
- ]
- },
- "auxiliary_text": {
- "number_of_fragments": 1,
- "fragment_size": 150,
- "type": "fvh",
- "order": "score",
- "matched_fields": [
- "auxiliary_text",
- "auxiliary_text.plain"
- ]
- },
- "file_text": {
- "number_of_fragments": 1,
- "fragment_size": 150,
- "type": "fvh",
- "order": "score",
- "matched_fields": [
- "file_text",
- "file_text.plain"
- ]
- }
- },
- "highlight_query": {
- "query_string": {
- "query": "breeding",
- "fields": [
- "title.plain^20",
- "redirect.title.plain^15",
- "category.plain^8",
- "heading.plain^5",
- "opening_text.plain^3",
- "text.plain^1",
- "auxiliary_text.plain^0.5",
- "file_text.plain^0.5",
- "title^10",
- "redirect.title^7.5",
- "category^4",
- "heading^2.5",
- "opening_text^1.5",
- "text^0.5",
- "auxiliary_text^0.25",
- "file_text^0.25"
- ],
- "auto_generate_phrase_queries": true,
- "phrase_slop": 1,
- "default_operator": "AND",
- "allow_leading_wildcard": true,
- "fuzzy_prefix_length": 2,
- "rewrite": "top_terms_boost_1024"
- }
- }
- },
- "suggest": {
- "text": "breeding",
- "suggest": {
- "phrase": {
- "field": "suggest",
- "size": 1,
- "max_errors": 2,
- "confidence": 2,
- "real_word_error_likelihood": 0.95,
- "direct_generator": [
- {
- "field": "suggest",
- "suggest_mode": "always",
- "max_term_freq": 0.5,
- "min_doc_freq": 0,
- "prefix_length": 2
- }
- ],
- "highlight": {
- "pre_tag": "<em>",
- "post_tag": "<\/em>"
- },
- "smoothing": {
- "stupid_backoff": {
- "discount": 0.4
- }
- }
- }
- }
- },
- "stats": [
- "suggest",
- "full_text",
- "full_text_querystring"
- ],
- "size": 20,
- "rescore": [
- {
- "window_size": 8192,
- "query": {
- "query_weight": 1,
- "rescore_query_weight": 1,
- "score_mode": "multiply",
- "rescore_query": {
- "function_score": {
- "functions": [
- {
- "field_value_factor": {
- "field": "incoming_links",
- "modifier": "log2p",
- "missing": 0
- }
- },
- {
- "weight": 0.25,
- "filter": {
- "terms": {
- "namespace": [
- 1
- ]
- }
- }
- },
- {
- "weight": 0.05,
- "filter": {
- "terms": {
- "namespace": [
- 2,
- 7,
- 8,
- 15,
- 3001
- ]
- }
- }
- },
- {
- "weight": 0.0125,
- "filter": {
- "terms": {
- "namespace": [
- 3,
- 9
- ]
- }
- }
- },
- {
- "weight": 0.1,
- "filter": {
- "terms": {
- "namespace": [
- 4,
- 12
- ]
- }
- }
- },
- {
- "weight": 0.025,
- "filter": {
- "terms": {
- "namespace": [
- 5,
- 13
- ]
- }
- }
- },
- {
- "weight": 0.2,
- "filter": {
- "terms": {
- "namespace": [
- 6,
- 14,
- 3000
- ]
- }
- }
- },
- {
- "weight": 0.005,
- "filter": {
- "terms": {
- "namespace": [
- 10
- ]
- }
- }
- },
- {
- "weight": 0.00125,
- "filter": {
- "terms": {
- "namespace": [
- 11
- ]
- }
- }
- }
- ]
- }
- }
- }
- }
- ]
- },
- "options": {
- "timeout": "20s",
- "search_type": "dfs_query_then_fetch"
- }
- } Nagaindukuri (talk) 15:00, 13 April 2018 (UTC)
- I don't see the output from elastic, were you able to get it by adding
&cirrusDumpResult
? - Have you been able to dig into the various logs of mediawiki and elasticsearch? DCausse (WMF) (talk) 07:35, 16 April 2018 (UTC)
- we did try that and was of no help.
- bitnami@haystack-np:~$ curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/_cat/indices?v
- > '
- curl: (3) Illegal characters found in URL
- bitnami@haystack-np:~$ curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/_cat/indices?v'
- health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
- green open mw_cirrus_metastore hJzCVBxfSYydggh-kMdviA 5 1 3 2 43.5kb 21.8kb
- green open .kibana fnMbVw4qRgmhONnEaeulMg 1 1 1 0 6.3kb 3.1kb
- green open bitnami_mediawiki_titlesuggest_1523285988 4Sx-yINMQROzxfZPzO97iQ 4 2 0 0 1.5kb 520b
- green open bitnami_mediawiki_content_first CpJo6tFERl6ahJZtAmCTHQ 4 2 0 0 1.5kb 520b
- green open bitnami_mediawiki_general_first SR2QSsH2R8-bHaQd7WS7cw 4 2 19 0 28.5kb 9.5kb
- green open exampleindex xa5hnV13SPuzFVBe__Tccg 5 1 1 0 12.8kb 6.4kb
- green open mw_cirrus_metastore_first yMPJqJ2fRf2ULU7GaFBiGw 1 2 1 0 10.7kb 3.5kb
- bitnami@haystack-np:~$ curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/exampleindex/_search?q=user:cooper&pretty'
- {
- "took" : 1,
- "timed_out" : false,
- "_shards" : {
- "total" : 5,
- "successful" : 5,
- "failed" : 0
- },
- "hits" : {
- "total" : 0,
- "max_score" : null,
- "hits" : [ ]
- }
- }
- bitnami@haystack-np:~$ curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/bitnami_mediawiki_general_first/_search?q=user:cooper&pretty'
- {
- "took" : 2,
- "timed_out" : false,
- "_shards" : {
- "total" : 4,
- "successful" : 4,
- "failed" : 0
- },
- "hits" : {
- "total" : 0,
- "max_score" : null,
- "hits" : [ ]
- }
- }
- bitnami@haystack-np:~$ curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/bitnami_mediawiki_titlesuggest_1523285988/_search?q=user:cooper&pretty'
- {
- "took" : 9,
- "timed_out" : false,
- "_shards" : {
- "total" : 4,
- "successful" : 4,
- "failed" : 0
- },
- "hits" : {
- "total" : 0,
- "max_score" : null,
- "hits" : [ ]
- }
- }
- bitnami@haystack-np:~$ curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/bitnami_mediawiki_content_first/_search?q=user:cooper&pretty'
- {
- "took" : 6,
- "timed_out" : false,
- "_shards" : {
- "total" : 4,
- "successful" : 4,
- "failed" : 0
- },
- "hits" : {
- "total" : 0,
- "max_score" : null,
- "hits" : [ ]
- }
- }
- bitnami@haystack-np:~$ curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/mw_cirrus_metastore/_search?q=user:cooper&pretty'
- {
- "took" : 4,
- "timed_out" : false,
- "_shards" : {
- "total" : 5,
- "successful" : 5,
- "failed" : 0
- },
- "hits" : {
- "total" : 0,
- "max_score" : null,
- "hits" : [ ]
- }
- }
- bitnami@haystack-np:~$ curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/bitnami_mediawiki_content_first/_search?q=user:cooper&cirrusDumpResult&pretty'
- {
- "error" : {
- "root_cause" : [
- {
- "type" : "illegal_argument_exception",
- "reason" : "request [/bitnami_mediawiki_content_first/_search] contains unrecognized parameter: [cirrusDumpResult]"
- }
- ],
- "type" : "illegal_argument_exception",
- "reason" : "request [/bitnami_mediawiki_content_first/_search] contains unrecognized parameter: [cirrusDumpResult]"
- },
- "status" : 400
- } Nagaindukuri (talk) 20:06, 16 April 2018 (UTC)
- You indices seem to be empty execpt 19 docs in the general index for your wiki
bitnami_mediawiki
. Have you tried searching on all namespace to see if you can display one of these results from mediawiki. - Also note that
cirrusDumpResult
is an URI param for mediawiki not elastic. curl -XGET 'https://aws-vpc-endpoint.es.amazonaws.com/bitnami_mediawiki_general/_search?pretty'
- should display few of the results you seem to have indexed properly, use one of the words you see there to search on every namespace using the Mediawiki
Special:Search
. - If you see some results CirrusSearch is working. DCausse (WMF) (talk) 08:06, 17 April 2018 (UTC)
- Thank you and yes we did the testing and see the result below .
- https://www.myproject.com/Main_Page?search=biw&title=Special:Search&profile=default&fulltext=1
- [98eaa51221b461b76c67b7fa] /Main_Page?search=biw&title=Special:Search&profile=default&fulltext=1 Error from line 474 of /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php: Call to a member function searchContainedSyntax() on boolean
- Backtrace:
- #0 /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php(384): SpecialSearch->showCreateLink(Title, integer, NULL, boolean)
- #1 /opt/bitnami/apps/mediawiki/htdocs/includes/specials/SpecialSearch.php(185): SpecialSearch->showResults(string)
- #2 /opt/bitnami/apps/mediawiki/htdocs/includes/specialpage/SpecialPage.php(522): SpecialSearch->execute(NULL)
- #3 /opt/bitnami/apps/mediawiki/htdocs/includes/specialpage/SpecialPageFactory.php(578): SpecialPage->run(NULL)
- #4 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(287): SpecialPageFactory::executePath(Title, RequestContext)
- #5 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(851): MediaWiki->performRequest()
- #6 /opt/bitnami/apps/mediawiki/htdocs/includes/MediaWiki.php(523): MediaWiki->main()
- #7 /opt/bitnami/apps/mediawiki/htdocs/index.php(43): MediaWiki->run()
- #8 {main} 164.144.252.28 (talk) 16:38, 19 April 2018 (UTC)
- https://www.myproject.com/Main_Page?search=biw&title=Special:Search&profile=default&fulltext=1&cirrusDumpResult
- flase Nagaindukuri (talk) 18:33, 19 April 2018 (UTC)
Suggestion: Expose and filter content by page views
Issue:
As a user, I'm interested in finding the most likely relevant pages based on page views.
Background
When searching for content such as media, there is no way to filter and choose images based on their "perceived" relevance. For example, if I want to find pictures of cats, chances are that commons has millions (https://www.mediawiki.org/w/index.php?title=Special:Search&profile=images&search=cat+filetype%3Aimage&fulltext=1&searchToken=ybda1dsf9dxkay60nn78qap2). It will also have a lot of irrelevant results that the user must sift through.
Other use cases
- This could be used to enhance the media insertion dialog using a better search parameter.
- Partly replace https://tools.wmflabs.org/massviews/
- Partly replace https://tools.wmflabs.org/mediaviews/?range=latest-20&files=
- Make it possible for editors to find popular topics to work on , e.g with Special:LintErrors (and insource:) one could easily start working on the most visible pages
Proposed solution
- Add either a numerical count or a "bar" indicating page views / popularity.
- Add a keyword to filter and sort them, e.g.: "pageviews:>5000"
- 197.218.83.43 (talk) 09:59, 3 May 2018 (UTC)
Search finds files by filename but no content within PDFs
I've set up a Mediawiki 1.30.0 with CirrusSearch 0.2 and Elastica 1.3.0.0 as exctensions as well as PDFHandler.
The search itself in the wiki is working fine - finds text from wiki pages as well as filenames.
But my main goal is to search WITHIN the PDF files, what should be possible by using CirrusSearch ... I installed Elasticsearch 5.4.3 which is running well as a service on my Windows 10.
Running the maintenance with:
C:\wamp64\www\mediawiki\extensions\CirrusSearch\maintenance\updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier=now
it runs for about a minute and seems to parse my 5 PDF files, currently uploaded in my wiki.
As result I get this - which doesn't look like an error:
content index... Fetching Elasticsearch version...5.4.3...ok Scanning available plugins...none Setting index identifier...my_wiki_content_1525866681 Picking analyzer...english Creating index...ok Validating number of shards...ok Validating replica range...ok Validating shard allocation settings...done Validating max shards per node...ok Validating analyzers...ok Validating mappings... Validating mapping...different...corrected Validating aliases... Validating my_wiki_content alias...is taken... Reindexing... Started reindex task: NzCI7echStm_vnG6pyK0_w:4310 Task: NzCI7echStm_vnG6pyK0_w:4310 Search Retries: 0 Bulk Retries: 0 Indexed: 0 / 3 Task: NzCI7echStm_vnG6pyK0_w:4310 Search Retries: 0 Bulk Retries: 0 Indexed: 3 / 3 Verifying counts...done Optimizing...Done Validating number of shards...ok Validating replica range...ok Validating shard allocation settings...done Validating max shards per node...is 12 but should be unlimited...corrected Waiting for all shards to start... active:4/4 relocating:0 initializing:0 unassigned:0 Swapping alias...done Removing old indices... my_wiki_content_first...done Validating my_wiki alias...alias not already assigned to this index...corrected Validating number of shards...ok Validating replica range...ok Validating shard allocation settings...done Validating max shards per node...ok Updating tracking indexes...done general index... Fetching Elasticsearch version...5.4.3...ok Scanning available plugins...none Setting index identifier...my_wiki_general_1525866712 Picking analyzer...english Creating index...ok Validating number of shards...ok Validating replica range...ok Validating shard allocation settings...done Validating max shards per node...ok Validating analyzers...ok Validating mappings... Validating mapping...different...corrected Validating aliases... Validating my_wiki_general alias...is taken... Reindexing... Started reindex task: NzCI7echStm_vnG6pyK0_w:4468 Task: NzCI7echStm_vnG6pyK0_w:4468 Search Retries: 0 Bulk Retries: 0 Indexed: 0 / 5 Task: NzCI7echStm_vnG6pyK0_w:4468 Search Retries: 0 Bulk Retries: 0 Indexed: 5 / 5 Verifying counts...done Optimizing...Done Validating number of shards...ok Validating replica range...ok Validating shard allocation settings...done Validating max shards per node...is 12 but should be unlimited...corrected Waiting for all shards to start... active:4/4 relocating:0 initializing:0 unassigned:0 Swapping alias...done Removing old indices... my_wiki_general_first...done Validating my_wiki alias...alias not already assigned to this index...corrected Validating number of shards...ok Validating replica range...ok Validating shard allocation settings...done Validating max shards per node...ok Updating tracking indexes...done Deleting namespaces...done Indexing namespaces...done
But in the end, the search in the wiki doesn't return any results from searching within the pdf files.
Is there something I missed? 213.211.236.242 (talk) 12:10, 9 May 2018 (UTC)
- I think that a maintenance script has to be run to index PDF files that were previously uploaded.
- If you try to upload a new file will its content be properly searched?
- If yes I think you need to run
refreshImageMetadata.php -f
andrebuildImages.php -f
. - See the debugging section in PdfHandler. DCausse (WMF) (talk) 10:06, 11 May 2018 (UTC)
- Thanks for reply.
- If I upload new PDFs they are not found too (its content).
- Running refreshImageMetadata.php -f I get for every uploaded file an error like (translated from german): "Wrong syntax file name, dirtectory ...".
- Diving into the code of refrehImageMetadata.php, I saw, that $row->img_name only contains the pure filename xyz.pdf without any folder information.
- The error occurs around line 170 at: $file->upgradeRow(); 213.211.236.242 (talk) 12:37, 16 May 2018 (UTC)
Elastic Search Can't Find Java
RESOLVED | |
Elasticsearch and Java are mandatory dependencies |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
The following command (from https://www.elastic.co/guide/en/elasticsearch/reference/current/zip-targz.html)
./bin/elasticsearchgives the following error:
/home/gunsywtx/public_html/extensions/elasticsearch-6.2.4$ ./bin/elasticsearch which: no java in (/home/gunsywtx/perl5/bin:/usr/local/cpanel/3rdparty/lib/path-bin:/usr/local/cpanel/3rdparty/lib/path-bin:/usr/local/jdk/bin:/usr/local/cpanel/3rdparty/lib/path-bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/cpanel/composer/bin:/usr/local/bin:/usr/X11R6/bin:/opt/puppetlabs/bin:/opt/dell/srvadmin/bin:/home/gunsywtx/bin) could not find java; set JAVA_HOME or ensure java is in PATHAny fix? Johnywhy (talk) 12:22, 3 June 2018 (UTC)
- Just found out my shared web-host does not have Java.
- Any way to use CirrusSearch without Java?
- thx Johnywhy (talk) 12:32, 3 June 2018 (UTC)
- No. [[kgh]] (talk) 12:54, 3 June 2018 (UTC)
CirrusSearch Only Partially Indexing
I posted this on the discussion for Help:CirrusSearch but am doing it here as well to see if I might find a solution.
I have a wiki running on a dev server with the following:
MediaWiki1.27.4
PHP5.6.25 (apache2handler)
MariaDB5.5.56-MariaDB
Elasticsearch1.7.6
Recently installed CirrusSearch, and it works as expected except for one issue: it's only returning a partial number of pages in the search results. For example, there are about 200 pages (yeah, it's not big) in the main namespace, but only 20 are returned. Likewise, there are about 1800 images, but only 160 are returned. I increased the memory for elasticsearch, but that had no discernible effect. Elastica is up and running. Null edits force the changes through, but I'd rather not do this 1K+ times.
Any ideas/suggestion as to how to fix this? Thanks in advance. 199.16.64.3 (talk) 15:59, 11 June 2018 (UTC)
- I think the first step would be to know if the problem is at index time or search time.
- Could you tell us if the output of the
forceSearchIndex.php
maintenance script is sane compared to the number of docs you have (it outputs:Indexed a total of XYZ pages at Y/s
). - To troubleshoot the issue I'd suggest that you paste the output of these commands:
- To know how many docs have been indexed you can ask to elastic with:
curl localhost:9200/wiki_name/_count?pretty
- Having the list of indices in elastic might as well to troubleshoot the issue:
curl localhost:9200/_cat/indices
- An example search query sent by Cirrus to elastic: you can obtain it by appending
&cirrusDumpQuery
to the search results page.
- To know how many docs have been indexed you can ask to elastic with:
- Thanks! DCausse (WMF) (talk) 07:39, 12 June 2018 (UTC)
- Thanks for the response! I'll get on this soon and respond in the next day or so. 199.16.64.3 (talk) 14:09, 12 June 2018 (UTC)
- Ok this is what I got running the commands.
- After running forceSearchIndex.php --skipLinks --indexOnSkip:
Skipping page with no content: 896 [wikidatabase] Indexed 9 pages ending at 900 at 18/second
- After running forceSearchIndex.php --skipParse:
Indexed a total of 3716 pages at 197/second
- After running curl localhost:9200/wiki_name/_count?pretty:
{ "error" : "IndexMissingException[[mediawiki] missing]", "status" : 404 }
- But when running the command as #curl localhost:9200/_count?pretty:\
{ "count" : 938, "_shards" : { "total" : 10, "successful" : 10, "failed" : 0 }
- When running the command curl localhost:9200/_cat/indices:
green open mediawiki_cirrussearch_frozen_indexes 1 0 0 0 144b 144b green open mw_cirrus_versions 1 0 2 2 3.5kb 3.5kb green open wikidatabase_general_first 4 0 800 574 8.1mb 8.1mb green open wikidatabase_content_first 4 0 136 18 41.7mb 41.7mb
- And this is the object returned when appending &cirrusDumpQuery to search for example term "rock":
{"description":"full_text search for 'rock'","path":"wikidatabase\/page\/_search","params":{"search_type":"dfs_query_then_fetch","timeout":"20s"},"query":{"_source":["id","title","namespace","redirect.*","timestamp","text_bytes"],"fields":"text.word_count","query":{"filtered":{"query":{"bool":{"minimum_number_should_match":1,"should":[{"query_string":{"query":"rock","fields":["all.plain^1","all^0.5"],"auto_generate_phrase_queries":true,"phrase_slop":0,"default_operator":"AND","allow_leading_wildcard":true,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024"}},{"multi_match":{"fields":["all_near_match^2"],"query":"rock"}}]}},"filter":{"terms":{"namespace":[0,102,108]}}}},"highlight":{"pre_tags":["</nowiki><nowiki><span class=\"searchmatch\">"],"post_tags":["<\/span>"],"fields":{"title":{"number_of_fragments":0,"type":"fvh","order":"score","matched_fields":["title","title.plain"]},"redirect.title":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","matched_fields":["redirect.title","redirect.title.plain"]},"category":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","matched_fields":["category","category.plain"]},"heading":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","matched_fields":["heading","heading.plain"]},"text":{"number_of_fragments":1,"fragment_size":150,"type":"fvh","order":"score","no_match_size":150,"matched_fields":["text","text.plain"]},"auxiliary_text":{"number_of_fragments":1,"fragment_size":150,"type":"fvh","order":"score","matched_fields":["auxiliary_text","auxiliary_text.plain"]}},"highlight_query":{"query_string":{"query":"rock","fields":["title.plain^20","redirect.title.plain^15","category.plain^8","heading.plain^5","opening_text.plain^3","text.plain^1","auxiliary_text.plain^0.5","title^10","redirect.title^7.5","category^4","heading^2.5","opening_text^1.5","text^0.5","auxiliary_text^0.25"],"auto_generate_phrase_queries":true,"phrase_slop":1,"default_operator":"AND","allow_leading_wildcard":true,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024"}}},"suggest":{"text":"rock","suggest":{"phrase":{"field":"suggest","size":1,"max_errors":2,"confidence":2,"real_word_error_likelihood":0.95,"direct_generator":[{"field":"suggest","suggest_mode":"always","max_term_freq":0.5,"min_doc_freq":0,"prefix_length":2}],"highlight":{"pre_tag":"<em>","post_tag":"<\/em>"},"smoothing":{"stupid_backoff":{"discount":0.4}}}}},"stats":["suggest","full_text"],"size":20,"rescore":[{"window_size":8192,"query":{"query_weight":1,"rescore_query_weight":1,"score_mode":"multiply","rescore_query":{"function_score":{"functions":[{"field_value_factor":{"field":"incoming_links","modifier":"log2p","missing":0}},{"weight":"0.2","filter":{"terms":{"namespace":[102,108]}}}]}}}}]},"options":{"search_type":"dfs_query_then_fetch","timeout":"20s"}}
- <b>Notice</b>: Uncommitted DB writes (transaction from DatabaseBase::query (User::loadFromDatabase)). in <b>/opt/rh/httpd24/root/var/www/html/mediawiki/includes/db/Database.php</b> on line <b>3306</b><br />
- Thanks! 104.162.109.170 (talk) 22:51, 15 June 2018 (UTC)
- I don't see anything obviously wrong in the outputs you've pasted.
- You mentioned that your wiki has 200 pages and about 1800 images but the _count reports 938 docs being indexed in total (including some non pages data such as namespace names and other metatada).
- I would suggest trying to find a page/image that you are unable to find and narrow down the investigation to it and understand why it's not indexed. To do this try to pickup a random image/page and search for few words it has in its title if you cannot find it using Special:Search (beware to select the proper namespaces) then you have found a bogus page.
- Then try to identify its page id using the
?action=info
URI param to the page url. - Using this page id try to run:
forceSearchIndex.php --fromId ID --toId ID+1
- to see if the maint script is able to repopulate this particular page.
- You may also want to try to run the sanitizer that will try to identify and fix inconsistencies in the index:
saneitizer.php
- So in the end it's unclear to me what is causing this behavior, I don't see any errors except the
Notice: Uncommitted DB writes
that you pasted at the end of the message. Do you remember which command generated this errors? - Good luck! DCausse (WMF) (talk) 08:39, 22 June 2018 (UTC)
ElasticSearch 5.6.10 "missing authentication token"
RESOLVED | |
was using the xpack security plugin without a license |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Hi, I'm trying to upgrade a MediaWiki deployment. I'm using MediaWiki 1.31.0, the corresponding CirrusSearch, and ElasticSearch 5.6.10. When starting with a fresh instance of ElasticSearch, I'm getting this error:
$ php updateSearchIndexConfig.php content index... Fetching Elasticsearch version... Unexpected Elasticsearch failure. Elasticsearch failed in an unexpected way. This is always a bug in CirrusSearch. Error type: Elastica\Exception\ResponseException Message: security_exception: missing authentication token for REST request [/] Trace: #0 /var/www/html/mediawiki-1.31.0/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Request.php(193): Elastica\Transport\Http->exec(Object(Elastica\Request), Array) #1 /var/www/html/mediawiki-1.31.0/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Client.php(674): Elastica\Request->send() #2 /var/www/html/mediawiki-1.31.0/extensions/CirrusSearch/includes/Maintenance/ConfigUtils.php(45): Elastica\Client->request('') #3 /var/www/html/mediawiki-1.31.0/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php(227): CirrusSearch\Maintenance\ConfigUtils->checkElasticsearchVersion() #4 /var/www/html/mediawiki-1.31.0/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(58): CirrusSearch\Maintenance\UpdateOneSearchIndexConfig->execute() #5 /var/www/html/mediawiki-1.31.0/maintenance/doMaintenance.php(94): CirrusSearch\Maintenance\UpdateSearchIndexConfig->execute() #6 /var/www/html/mediawiki-1.31.0/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(65): require_once('/var/www/html/m...') #7 {main}The docs don't seem to say anything about needing an authentication token. What am I doing wrong? 2620:11E:1000:120:1792:B56B:4655:6029 (talk) 19:55, 26 June 2018 (UTC)
- You seem to be using XPack security.
- Does it resolve your issue if you set :
xpack.security.enabled: false
- in you elasticsearch.yml config file? DCausse (WMF) (talk) 08:33, 27 June 2018 (UTC)
- Indeed, the root cause was that I was using the non-open-source elasticsearch docker images. Switched to the -oss ones and this problem went away. 2620:11E:1000:120:1792:B56B:4655:6029 (talk) 21:24, 11 July 2018 (UTC)
How to skip template metadata in search results?
Hi,
Most of the pages on my website uses a template with metadata at the top of the page. The standard MW search engine did'nt include the metadata in the search result, but CirrusSearch/ElasticSearch do.
The information in the metadata fields are important, both as an overview of the content and in searches, but does not look pretty in the search results (eg. "Published: 2001-08-03 Keywords: Some words Author: Some name Summary: Some sentences").
Is there a way to get the search result to look better when pages use a template?
Maybe it's possible to only show the metafield "Summary" as a first choice and if it's empty, then show the start of the article that starts after the metadata field?
or
Just show the content of the article (skip the meta fields), like the standard MW search do?
Thanks for any advice on how to do this. Pretor~nowiki (talk) 04:50, 9 July 2018 (UTC)
- Have you looked into the possibility of excluding some content from the search index?
- Please see Help:CirrusSearch#Exclude_content_from_the_search_index DCausse (WMF) (talk) 20:14, 25 July 2018 (UTC)
How to config CirrusSearch with multiple wikis (same host)
I'm using this guide (with a few changes) and get wikis and parsoid + wikiEditor work. But not CirrusSearch. It's always complain: An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later. Unfortunately, I didn't find any tutorials, solutions... that works or at least show the log what's wrong. All the scripts run without errors (updateSearchIndexConfig.php forceSearchIndex.php ... ), it's always point to my old setup :(
How to fix that 14.231.223.128 (talk) 07:19, 21 July 2018 (UTC)
- Perhaps it's because you load CirrusSearch before setting $wgDBName?
- I think the proper order would be:
- Set wgDBName
- load CirrusSearch
- set CirrusSearch config DCausse (WMF) (talk) 09:07, 23 July 2018 (UTC)
wmf extra plugin
I read here regarding ICU folding:
- Requires the ICU plugin installed and a recent wmf extra plugin (>= 2.3.4)
What and where is the wmf extra plugin? Spiros71 (talk) 10:01, 22 July 2018 (UTC)
- This is an elasticsearch plugin, it is located here: https://gerrit.wikimedia.org/r/plugins/gitiles/search/extra/+/master (this page contains instructions on how to install it). DCausse (WMF) (talk) 07:10, 23 July 2018 (UTC)
Suggestion: Show exact search string used
Issue
As a user I'd like to know exactly what was searched for.
Background
A common problem with search engines is that they unilaterally make rewrite search strings for the user and sometimes show unexpected incorrect search results.
Example: Search string "Weird-stuff"
Expected
Search results include a note " searching for 'weird' AND 'stuff', . Alternatively, show normal results but suggest that the user includes quotes ("weird-stuff").
Actual
- en.Wikipedia: https://en.wikipedia.org/w/index.php?search=weird-stuff
- Google : www.google.com/search?q=weird-stuff
- Phabricator: https://secure.phabricator.com/search/query/QrVJhKn8_rnJ/#R
Note: 1 & 2 currently remove the "-" and doesn't even inform the user, secure.phabricator shows expected results.
Example 2: Advanced search string (using engine specific keywords)
- -calendar intitle:weird - https://en.wikipedia.org/w/index.php?search=weird
- -calendar title:weird - https://secure.phabricator.com/search/query/V06rHk5LER_U/#R
- -calendar allintitle:weird - //www.google.com/search?q=allintitle:weird
Note: secure.phabricator currently clearly notes strings that are excluded, and emphasizes that titles are matched rather than a string "title".
Proposed solution
For simple strings show exactly what strings were searched (e.g. if symbols were dropped) for keyword search include matched keywords. 197.218.92.1 (talk) 11:28, 25 July 2018 (UTC)
- I'm afraid that there would be too many variations of the search string to display as it is processed differently depending on the field it is looking for. So in the end we would have to display about a dozen of variations of the search query for every identified words.
- I think the approach that is generally accepted is to have a large search scope by default and provide the necessary syntax to allow searchers to narrow the search results.
- Here are the features we usually use to make stricter searches:
- wraps word inside double quotes: "weird stuff": will force weird stuff to appear close to each others and disable stemming
- use insource:"weird stuff" to search only the source text (excluding template transclusion)
- use insource:/weird-stuff/ (slow) to search exactly for weird-stuff (used to search specific queries where punctuation is important)
- Please see https://blog.wikimedia.org/2017/11/06/searching-techniques/.
- Concerning what phabricator does about "explaining" the search query syntax is a good idea and we may try to display this in the future.
- But it's unlikely that we'll be able to display all the variations attempted when analyzing a word (case folding, diacritics removal, stemming,...). DCausse (WMF) (talk) 14:28, 25 July 2018 (UTC)
- Yes, it is not necessary to indicate that stemming, and case folding are applied. The primary ideal here is to give users a glimpse of what exactly is being searched, for example, with the allintitles: keyword, google adds buttons to indicate when a word is not included at all in the search ( "missing: weird").
- I'd say that even a simple explanation of the search would be immensely useful, consider the case that someone copy pastes a random phrase like "book -shakespeare" or "gadget:the+movie" and search results immediately exclude lots of data. In the latter case this is completely strange.
- > I think the approach that is generally accepted is to have a large search scope by default and provide the necessary syntax to allow searchers to narrow the search results.
- People can't fix or narrow down things if they aren't even aware why it isn't working. They'll either retry and give up, claim that the search engine is broken, or search for it elsewhere.
- The basic idea would be something like this:
- Indicate when certain symbols are dropped, e.g. "stuff !@#$%^&*" - > "stuff"
- Show search tokens separately, especially when they are originally one token , e.g. allez-vous -> "allez vous"
- Show a separate message when all search terms are discarded, e.g. compare "special:search/$$$" vs Special:Search/insource:/\$\$\$/ .
- Maybe some indication when special search keywords are active - phabricator oddly drops them from the "searched for".
- 197.218.80.177 (talk) 16:38, 25 July 2018 (UTC)
- I get your point and I agree that it may be frustrating.
- But I worry about the technical implications since nothing is really trivial in search.
- For instance when you say that the dollar sign is dropped from the search query, this is partially true, the dollar is kept to match the title: $O$ or $.
- It simply does not match anything in the content of the page.
- As for the suggestion for adding "missing: weird" like what google does, it's really powerful but as of today we require all the terms to match. What is probably misleading is that all the words that matched may not be obvious to find in the original doc for the searcher:
- the text snippet may not always highlight it (accuracy problem with limited space)
- it's perhaps part of something that is not directly visible on the page (content hidden behind a show/hide section, hidden category)
- poor ranking
- If someday we relax the query and allow some terms not to match we'll certainly have to do what google does to limit frustration in certain cases.
- As for the punctuation I don't really know how to make this more fluent and less surprising for the user, perhaps what you suggest is the right solution but I still don't know how to decide which analyzer to run to show the "sanitized version" of the search string (as discussed before many different analyzers are run on the search query).
- Thanks for your suggestions. DCausse (WMF) (talk) 19:49, 25 July 2018 (UTC)
- Thanks for the explanations.
- I'd say say the problem here is like attempting to look left and right at the same time. Wikis try to cater to pure readers who will never edit a page and don't care even a bit about advanced syntax, and editors who love these things. The end result is a tool that is not a great fit for any of these. For example, if the goal was only to make this interface intuitive for readers, then a colon would never have special meaning, and the search engine would always offer to escape whatever string the editor uses.
- Personally, I'm a fan of simplicity and using the simplest approach that works +80% of the time:
- If the string contains any unusual token (Loo$%^&:) then simply fall back to suggesting that the user run a search without it (e.g. "Showing results for "Loo $%^&:", try searching for "Loo" for potentially better results").
- If the string is plain alphanumeric or equivalent for non-latin languages show tokens separately (if applicable).
- Lastly, show a small hint whenever advanced filters are active.
- It would be good if there was a way to validate that "Loo $%^&:" and "Loo" will yield same internal search query. But even if not, it would still be a good idea to suggest it anyway. Then it won't really matter if certain characters are dropped or not. This also covers almost all scenarios by always providing feedback the user can use to improve search results.
- I do realize that this might never be added due to the search complications mentioned. 197.218.89.228 (talk) 09:58, 26 July 2018 (UTC)
Suggestion: Add a character at the end and start of title for intitle regex (or suffixsearch)
Issue:
It seems very hard or nearly impossible to simply match the last words of a title using regex.
Background:
Looking through tons of docs about lucene search this seems to always be a strangely missing implementation. It seems that the primary reason is that it is time consuming or inefficient to create. However, adding a marker to every string would make this possible.
Proposed solutions:
- Add a character e.g. \n to the end (and start) of every title that can be matched by "\\n"
- Hide these in search results so it doesn't confuse regular users
This would make it possible to write stuff like "/.*suffix\\n/" and always match the end of the string.
Alternatively, there seems to be an idea about a suffixsearch : https://discuss.elastic.co/t/ends-with-operator-in-elastic-search/139352/3:
using a custom analyzer that Reverses tokens using reverse token filter Use edgengram token filter to generate reversed prefixes Reverse prefixes to get suffixes using reverse token filter
Or, perhaps there is a way with the current regex that isn't obvious? 197.218.80.174 (talk) 21:49, 30 July 2018 (UTC)
- Actually, any illegal title character(Help:Bad title) would suffice to indicate it, maybe "|" would be good enough. 197.218.80.174 (talk) 23:08, 30 July 2018 (UTC)
- Due to how our regex search is implemented, this is essentially a request to support start/end (^/$) in the regex syntax. I don't think it would be too hard to adjust our existing regex plugin to do that, it's simply not something that is handled currently.
- The addition of a start/end marker will still be useful for the query acceleration phase which reduces the number of documents we need to run the regex on. EBernhardson (WMF) (talk) 22:47, 1 August 2018 (UTC)
- Ah, now it makes sense. It is no wonder that the regex works differently from the description in https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#_standard_operators.So Cirrussearch doesn't anchor it by default like lucene does?
- Adding such anchoring would also solve tasks like https://phabricator.wikimedia.org/T90090, and reduce or eliminate the need for https://phabricator.wikimedia.org/T12808.
- As it would also improve performance it seems like a generally good idea, assuming that it is not time consuming to add the relevant code. 197.218.92.173 (talk) 10:30, 2 August 2018 (UTC)
Indexes not updating after editing/creating article
Hi there,
I have set up CirrusSearch by following the installation instructions here (only difference is, as I'm on MW 1.28.3, I downloaded the according REL versions for both Elastica and CirrusSearch from GitHub).
The initial index is created perfectly, however any edits to articles/templates or new article creation do not spawn any "cirrusSearchLinksUpdate" jobs in the job queue.
This is my setup:
Product | Version |
---|---|
MediaWiki | 1.28.3 |
PHP | 7.0.30-0ubuntu0.16.04.1 (fpm-fcgi) |
MySQL | 5.7.22-0ubuntu0.16.04.1 |
Elasticsearch | 2.3.3 |
REL_128 for both:
CirrusSearch | 0.2 | GPL-2.0+ | Elasticsearch-powered search for MediaWiki | Nik Everett, Chad Horohoe, Erik Bernhardson and others |
Elastica | 1.3.0.0 | GPL-2.0+ | Base Elasticsearch functionality for other extensions by providing Elasticalibrary | Nik Everett and Chad Horohoe |
LocalSettings.php:
wfLoadExtension( 'Elastica' ); require_once "$IP/extensions/CirrusSearch/CirrusSearch.php"; $wgSearchType = 'CirrusSearch';
Any idea what's causing this?
Kind regards,
Viktor 84.105.220.61 (talk) 10:09, 6 August 2018 (UTC)
- it seems im facing the same issue, with mediawiki 1.31, php 7.0.30-0+deb9u1, sqlite 3.16.2 and elasticsearch 5.6.10. Aretni (talk) 09:48, 15 August 2018 (UTC)
- Did you switch
$wgDisableSearchUpdate
back to false after running the first maint scripts to populate the indices? DCausse (WMF) (talk) 07:49, 11 September 2018 (UTC)
cirrusDumpQuery for geosearch
RESOLVED | |
no debug param for GeoData query can be seen from the source code |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Does Nearby/geosearch use elasticsearch? I have been trying to see how geosearch queries to elastic search look and I keep getting an 'unrecognized parameter' error.
Here is a sample query I tried...
https://en.wikipedia.org/w/api.php?action=query&list=geosearch&gscoord=37.786952%7C-122.399523&gsradius=10000&gslimit=10&cirrusDumpQuery 2402:3A80:47A:8AF2:A577:374D:27A0:6729 (talk) 12:40, 29 August 2018 (UTC)
- Yes it uses elastic but the GeoData extension does not support this debug param.
- If if you want to see how the query looks like please take a look at the GeoData extension source code. DCausse (WMF) (talk) 13:16, 29 August 2018 (UTC)
- Thanks! Will check it out. 2402:3A80:47A:8AF2:A577:374D:27A0:6729 (talk) 13:23, 29 August 2018 (UTC)
Highlighting search term in end page
Is there a way to highlight the search term in the page chosen as a match by the end-user and jump the browser to that section? Given my (limited) knowledge of the MediaWiki architecture, that would be a tall order to implement, but in javascript, it could be relatively easy.
Thanks! Tinss (talk) 02:56, 13 September 2018 (UTC)
- Implementing through the backend directly would indeed be a bit painful. Some javascript that parses the the highlight and finds candidate(s) on the result page seems like a good hack-a-thon project for someone. EBernhardson (WMF) (talk) 17:54, 13 September 2018 (UTC)
- Ok. I've put that in my todo list. Once the widget is done, I'll share it with the MediaWiki community. Tinss (talk) 22:26, 13 September 2018 (UTC)
- Hi @Tinss, Did you actually create this? Nischayn22 (talk) 17:17, 22 April 2019 (UTC)
- Sorry no, I haven't had the time to do so. Tinss (talk) 21:56, 23 April 2019 (UTC)
Suggestion: Automatically suggest titles from other namespaces when search fails
Issue: Sometimes a user can't quite recall in what namespace a title exists.
Example:createaccountblock doesn't exist in default searched namespaces, but currently there is a page on this wiki with that exact title.
Proposed solution
- Suggest exact title matches in other namespaces when no exact "title" match exists in default namespaces
- Add a prefix to all single namespace searches, e.g. if someone searches for "dogsandcats" in the user namespace it should either show an existing page or a redlink prefixed by the page ("user:dogsandcats"). That would indirectly solve the complaint in Help talk:CirrusSearch/2018#h-This_is_the_most_worst_search_engine_on_the_internet-2018-06-09T18:49:00.000Z .
- Provide a hint to the user to search all namespaces instead. 197.235.89.142 (talk) 09:13, 15 September 2018 (UTC)
Cannot spawn child: CirrusSearch\Maintenance\IndexNamespaces
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
MediaWiki 1.31.0
PHP 7.1.8 (apache2handler) MySQL 5.6.10 elasticsearch 6.4.1
I'm trying to setup CirrusSearch and I'm running into issues.
$wgDisableSearchUpdate = true; if (!$wgDisableSearchUpdate ) { require_once( "$IP/extensions/CirrusSearch/CirrusSearch.php" ); $wgCirrusSearchServers = array('server'); $wgSearchType = 'CirrusSearch'; $wgCirrusSearchUseExperimentalHighlighter = false; $wgCirrusSearchOptimizeIndexForExperimentalHighlighter = false; $wgCirrusSearchEnableRegex = false; $wgCirrusSearchUseCompletionSuggester = 'no'; }
Error:
php extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php indexing namespaces... Cannot spawn child: CirrusSearch\Maintenance\IndexNamespaces [f9201610b391868b0f987974] [no req] Error from line 675 of wiki/maintenance/Maintenance.php: Class 'CirrusSearch\Maintenance\IndexNamespaces' not found Backtrace: #0 extensions/CirrusSearch/includes/Maintenance/Maintenance.php(87): Maintenance->runChild(string, NULL) #1 extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(56): CirrusSearch\Maintenance\Maintenance->runChild(string) #2 wiki/maintenance/doMaintenance.php(94): CirrusSearch\Maintenance\UpdateSearchIndexConfig->execute() #3 extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(73): require_once(string) #4 {main} Legaulph (talk) 10:48, 27 September 2018 (UTC)
- I had to switch to elasticsearch 5.3.0.
- Now it is indexing Legaulph (talk) 17:35, 27 September 2018 (UTC)
- After the indexing I now get an error during wiki search:
- Legaulph (talk) 18:18, 27 September 2018 (UTC)
[W60ecicuWZeAtlHZ2ZlYYwAAAAM] /index.php?title=Special%3ASearch&search=PM&go=Go Error from line 57 of /app/mediawiki/extensions/CirrusSearch/includes/Search/SearchRequestBuilder.php: Call to undefined method Elastica\Query::setStoredFields() Backtrace: #0 /app/mediawiki/extensions/CirrusSearch/includes/Searcher.php(453): CirrusSearch\Search\SearchRequestBuilder->build() #1 /app/mediawiki/extensions/CirrusSearch/includes/Searcher.php(461): CirrusSearch\Searcher->buildSearch() #2 /app/mediawiki/extensions/CirrusSearch/includes/Searcher.php(199): CirrusSearch\Searcher->searchOne() #3 /app/mediawiki/extensions/CirrusSearch/includes/Hooks.php(542): CirrusSearch\Searcher->nearMatchTitleSearch(string) #4 /app/mediawiki/includes/Hooks.php(177): CirrusSearch\Hooks::onSearchGetNearMatch(string, NULL) #5 /app/mediawiki/includes/Hooks.php(205): Hooks::callHook(string, array, array, NULL) #6 /app/mediawiki/includes/search/SearchNearMatcher.php(123): Hooks::run(string, array) #7 /app/mediawiki/includes/search/SearchNearMatcher.php(32): SearchNearMatcher->getNearMatchInternal(string) #8 /app/mediawiki/includes/specials/SpecialSearch.php(253): SearchNearMatcher->getNearMatch(string) #9 /app/mediawiki/includes/specials/SpecialSearch.php(143): SpecialSearch->goResult(string) #10 /app/mediawiki/includes/specialpage/SpecialPage.php(522): SpecialSearch->execute(NULL) #11 /app/mediawiki/includes/specialpage/SpecialPageFactory.php(568): SpecialPage->run(NULL) #12 /app/mediawiki/includes/MediaWiki.php(288): SpecialPageFactory::executePath(Title, RequestContext) #13 /app/mediawiki/includes/MediaWiki.php(861): MediaWiki->performRequest() #14 /app/mediawiki/includes/MediaWiki.php(524): MediaWiki->main() #15 /app/mediawiki/index.php(42): MediaWiki->run() #16 {main}
- Now I have it, installed elasticsearch 5.6.2
- Run php composer.phar update --no-dev in the Elastica folder. 148.177.1.215 (talk) 15:37, 28 September 2018 (UTC)
- After further analysis
- !Product
- !Version
- |-
- |MediaWiki
- |1.31.0
- |-
- |PHP
- |7.1.8 (apache2handler)
- |-
- |MySQL
- |5.6.10
- |-
- |ICU
- |50.1.2
- |-
- |Elasticsearch
- |5.6.12
- |}
- Elastica 1.3.0.0 (7019d96) 20:49, 13 April 2018
- |CirrusSearch
- |0.2
- |}
- Starting a elasticsearch install on Redhat 7
- I could not start with Elasticsearch 5.6.12. and run php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php as described in the readme.
- I had to install 2.3.3 run php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php this started the index and failed saying that 3.2.2 was not supported.
- Now I re-installed Elasticsearch 5.6.12 and ran php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php, this worked
- Now I could finish the scripts described in the readme file. Legaulph (talk) 13:22, 1 October 2018 (UTC)
The discussion above is closed. Please do not modify it. No further edits should be made to this discussion.action=cirrusDumpQuery is not working
Trying to troubleshoot why CirrusSearch is not returning any result in Special:Search. Pages are indexed properly. There is nothing in the error log. But during testing, we found out that certain actions from https://www.mediawiki.org/wiki/Extension:CirrusSearch#API are not working. '?action=cirrusdump' is working fine but '?action=cirrusDumpQuery' doesn't return json content at all - it just redirects back to Special:Search. Any idea of where to look next? MediaWiki 1.31.0 Elasticsearch 5.6.12 CirrusSearch 0.2 Elastica 1.3.0.0 Lalquier (talk) 13:12, 1 October 2018 (UTC)
cirrusDumpQuery
is not a pageaction but a debug param, it must be used like :URL&cirrusDumpQuery
. DCausse (WMF) (talk) 15:42, 1 October 2018 (UTC)- More on this issue. Using the right syntax, cirrusDumpQuery works fine on MW 1.30 with ES 2.4.2, but it is returning the search page instead of json results on MW 1.31.1 with ES 5.6.12. On that MW 1.31.1 instance, we checked that indexing is going well from wiki pages. I can run queries directly against the ES server. The issue seems to be in the 'last mile' between MW making the query to ES and rendering the search results. Lalquier (talk) 12:09, 18 October 2018 (UTC)
- @Lalquier same issue here - It simply doesn't work in 1.31.1 & ES 5.6.12 - Indexes are built no problem & I can query them from ES; however, setting Cirrus as the search type just doesn't work. 4.53.192.131 (talk) 14:30, 25 February 2019 (UTC)
- Okay, so if anyone runs into this. The solution (for me at least was) to make sure you're using CirrusSearch @ ad9a0d9 (REL1_31) and not master, or the version listed for MW1.32. Anything after commit ad9a0d9 doesn't work properly with 1.31
- |MediaWiki
- |1.31.1
- |-
- |PHP
- |7.2.15-0ubuntu0.18.04.1 (fpm-fcgi)
- |-
- |MySQL
- |5.7.25-0ubuntu0.18.04.2
- |-
- |ICU
- |60.2
- |-
- |Elasticsearch
- |5.6.15
- |} 4.53.192.131 (talk) 15:26, 25 February 2019 (UTC)
How to Search content and page titles rather than just page titles with cirrussearch
RESOLVED php updateSearchIndexConfig.php --startOver && php forceSearchIndex.php The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Hi, I replaced the mediawiki search with the cirrussearch extension. It works very well. At least now I can find a lot more pages based on there titles. But for some reason I am not able to get any match based on page content. For example a search for: "This is my page content" would get zero matches and would suggest to create a page with the title "This is my page content". I already went trough the Extension:CirrusSearch page and its further links, but I can't find the right option. Would be awesome, if someone could point me in the right direction. I am running mediawiki version: 1.29.1 PHP: 7.0.32 MySQL: 5.7.24 Thank you :-) Rivaldez (talk) 15:42, 1 October 2018 (UTC)- That shouldn't generally happen, or at least there is no supported option for all full text searches to only search the title. Some things i might check:
- On an article page add `&action=cirrusdump` to the url, such as: https://en.wikipedia.org/wiki/foobar?action=cirrusdump. The main question would be if the `text` field is correctly populated here.
- On the search page add `&cirrusDumpQuery` to the url, such as: https://en.wikipedia.org/wiki/Special:Search?search=kennedy&fulltext=1&cirrusDumpQuery. The main question this would answer is if the expected fields are being included in the query sent to elasticsearch. EBernhardson (WMF) (talk) 19:46, 1 October 2018 (UTC)
- Thank you very much for your time!
- I did as you suggested and added ?action=cirrusdump to a page named docker
- {
- "_index": "db_wiki_db_content_first",
- "_type": "page",
- "_id": "2528",
- "_version": [],
- "_source": {
- "version": 11697,
- "wiki": "db_wiki_db",
- "namespace": 0,
- "namespace_text": "",
- "title": "Docker",
- "timestamp": "2017-11-29T06:51:34Z"
- }
- }
- If I see it correctly there is no text? So maybe I did the indexing wrong?
- the &cirrusDumpQuery was a lot of output therefore I pushed it to an github repo.
- For me it seemed to be correct. Detail in the link below.
- Link to cirrusDumpQuery Rivaldez (talk) 07:45, 2 October 2018 (UTC)
- That's a rather surprising doc to see. A doc will be built with that shape only if `skip links` and `skip parse` are set, which should only happen from the forceSearchIndex.php maintenance script when explicitly set.
- If you edit a page, does anything in the doc returned by action=cirrusdump change? In particular the `timestamp` field should update to the revision timestamp, and `version` should update to the new revision id. If not that suggests live updates may not be working, which is the code path that can't skip generating fields like the revision text.
- One other thing to check would be if rebuilding the index works. There is a maintenance script `forceSearchIndex.php` which can be run with no options which will iterate over all pages in the wiki and index them. EBernhardson (WMF) (talk) 21:19, 3 October 2018 (UTC)
- Thank you for your support! I finally manged to get the search working right. It's awesome now!
- What I did wrong, was indeed running forceSearchIndex.php with --skipLinks and --skipParse.
- I did this because I blind followed the instructions of the Readme refrenced in the Extension:CirrusSearch arcticle.
- To fix it I just reran the following two commands.
- php updateSearchIndexConfig.php --startOver
- php forceSearchIndex.php
- Thanks again :-) Rivaldez (talk) 14:54, 5 October 2018 (UTC)
The discussion above is closed. Please do not modify it. No further edits should be made to this discussion.
What to do if ElasticSearch isn't running on port 9200
I installed Percona and it took over port 9200. I found ElasticSearch on port 9201 and had to look up how to configure CirrusSearch to use an alternative port. After searching through the code I found that I could do the following:
$wgCirrusSearchServers = [ [ 'host' => "127.0.0.1", 'port' => 9201 ] ];
☠MarkAHershberger☢(talk)☣ 00:59, 20 November 2018 (UTC)
Issue: Commons "sister-search" disabled?
Issue: Search results no longer return any results from commons Steps to reproduce
Expected An image from commons on the sidebar. Actual Not even a single mention of commons results. Notes: This is might be related to the new deployment of "Extension:AdvancedSearch" , apparently this was only disabled on english wikipedia (and maybe a few other wikis), https://phabricator.wikimedia.org/T163463. 197.218.80.183 (talk) 13:00, 28 November 2018 (UTC)
- Just a quick note regarding AdvancedSearch. You can disable the AdvancedSearch interface completely via the user settings ( this is a brand new feature ). When I do that and visit the link above I also do not get an image. So this is probably unrelated to AdvancedSearch and might be something else :-/
- Best wishes,
- Christoph Christoph Jauera (WMDE) (talk) 14:17, 28 November 2018 (UTC)
- That's a good point. I suppose it can also easily be seen by disabling javascript as well. The odd thing is that only commons was disabled, considering that the sidebar brings in multiple from different wikis, one would have expected all of them to stop working at the same time. It might be just a coincidence.
- I guess the feature isn't that popular considering that it probably stopped working quite a long time ago and nobody noticed it. Or maybe the fact that it randomly pops up only when there are images makes it harder to notice It might be good to always suggest searching commons for an image.
- Anyway, thanks for the double checking it ... 197.218.80.183 (talk) 14:30, 28 November 2018 (UTC)
Suggestion: Provide a plain search (no analysis)
Issue: It is currently impossible to search for an exact string that contains certain symbols. Steps to reproduce
- Search for content that is added by a template or contains symbols , e.g. " 〃", https://en.wikipedia.org/w/index.php?search=%22%E3%80%83%22&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B0%5D%7D&ns0=1
- Go to the page and use the browser search to find it.
Expected It should be possible to find basic symbols Actual Certain symbols are impossible to find Proposed solution
- Add a plaintext keyword, e.g. "plaintext: 〃" .
This would do no analysis, no stemming, no normalization. Notes: Insource doesn't always work for this because it can only detect content saved to the page, it can't extract transcluded content, and the default search can't address it because it is optimized for readers, and tries its best to normalize searches. While looking for a way to escape elasticsearch strings, I came across this possible solution: https://discuss.elastic.co/t/how-to-index-special-characters-and-search-those-special-characters-in-elasticsearch/42506. This could also improve the current limited "exact text" field in Extension:AdvancedSearch. 197.218.80.183 (talk) 16:02, 28 November 2018 (UTC)
- A more convincing real world example might be https://phabricator.wikimedia.org/T87136, and similar cases for other non-latin languages. 197.218.80.183 (talk) 16:12, 28 November 2018 (UTC)
- Hey IP, thanks for the suggestion. I've added a note to the phab task to mention this request. CKoerner (WMF) (talk) 15:36, 30 November 2018 (UTC)
- That task was only one use-case. It will not solve the general problem, see https://www.mediawiki.org/w/index.php?title=Help%20talk%3AExtension%3AAdvancedSearch/2018#h-Issue%3A_Exact_this_search_does_not_match_exact_string-2018-11-28T14%3A41%3A00.000Z for a real world example. The problem is that while all these transformations do help in the general case they don't always work properly for a multi-lingual platform like mediawiki. So in that instance exact search will never be exact because it will always be case sensitive, case folded, and have many tokens stripped.
- For instance, I randomly found a symbol (〆) while reading up an article, and searched for it. Google finds (Google:〆 site:en.wikipedia.org) many cases, while english wikipedia currently only finds a single one ( ,). The reason it even finds that character at all is because there is a redirect to it.
- The generic problem can probably only be solved by a different search keyword. 197.218.84.150 (talk) 10:44, 2 December 2018 (UTC)
- Yeah, the general case is different from the German daß/dass problem in that "non-word" symbols, like punctuation, are not going to be indexed even if we deal with ß/ss correctly.
- > This would do no analysis, no stemming, no normalization.
- I can see not doing stemming or normalization, but "analysis" includes tokenization, which is more or less breaking text up into words in English (and much more complex in Chinese and Japanese, for example). Would you want to skip tokenization, too?
- Without tokenization a search for
bot
would return matches forbot
,robot
,botulism
, andphlebotomy
? Would you want to be able to search oning te
and matchbreaking text
, but notbreaking text
(with two spaces between words). Would you want searches fortext
,text,
,text.
, andtext"
to all give different results? It sounds like the answer is yes, so I'll assume that's the case. - The problem is that this kind of search is extremely expensive. For the current insource regex search, we index the text as trigrams (3-character sequences—so
some text
is indexed assom
,ome
,me
(with a final space)e t
(with a space in the middle),te
(with an initial space),tex
, andext
). We try to find trigrams in a regex being searched to limit the number of documents we have to scan with the exact regex. That's why insource regex queries with only one character or with really complex patterns with no plain text almost always time out on English Wikipedia—they have to scan the entire document collection looking for the one character or the complex pattern. But insource queries for/ing text/
or/text\"/
have a chance—though apparently matching the trigraming
gives too many results in English and the query still times out! - Indexing every letter (or even every bigram) would lead to incredibly large indexes, with many index entries having millions of documents (most individual letters, all common short words like in, on, an, to, of, and common grammatical inflections like ed). Right now you can search for the on English Wikipedia and get almost 5.7M hits. It works and doesn't time out because no post-processing of those documents is necessary to verify the hits—unlike a regex search which still has to grep through the trigram results to make sure the pattern matches.
- An alternative might be to do tokenization such that no characters are lost, but the text is still divided into "words" and other tokens. In such a scenario,
text."
would probably be indexed astext
,.
, and"
, and a search fortext."
would not match, say,context."
. There are still complications with whitespace, and a more efficient implementation that works on tokens (which is what the underlying search engine, Elasticsearch, is built to do) might still matchtext . "
andtext."
because both have the three tokenstext
,.
, and"
in a row. A more exact implementation would find all documents withtext
,.
, and"
in them, and then scan for the exact stringtext."
like the regex matching does, but that would have the same limitations and time outs that the regex matching does. - Unfortunately, your use cases are just not well-supported by a full-text search engine, and that's what we have to work with. I don't think there's any way to justify the expense of supporting such an index. And even if we did build the indexes required, if getting rid of time outs and incomplete results would require significantly more servers dedicated to search.
- Even Google doesn't handle the 〃 case (Google:
〃 site:en.wikipedia.org
). It drops the 〃 and gives roughly the same results assite:en.wikipedia.org
(it actually gives a slightly lower results count—61.3M vs 61.5M—but the top 10 are identical and the top 1 doesn't contain 〃). - Also, note that Google doesn't find every instance of 〆. The first result I get with an insource search on-wiki is Takeminakata, which has 〆 in the references. The Google results seem to be primarily instances of 〆 all by itself, though there are some others. (I'm not sure what the appropriate tokenization of 〆捕 is, for example, so it may get split up into 〆 and 捕; I just don't know.)
- I'm having some technical difficulties with my dev environment at the moment, so I can't check, but indexing 〆 by itself might be possible. It depends on whether it is eliminated by the tokenizer or by the normalization step. I think we could possibly prevent the normalization from normalizing tokens to nothing—which would probably apply to some other characters such as diacritics like ¨—but preventing the tokenizer from ignoring punctuation characters would be a different level of complexity. There are also questions of what such a hack would do to indexing speed and index sizes, so even if it is technically feasible, it might not be practically feasible. I'll try to look at it when my dev environment is back online. TJones (WMF) (talk) 17:48, 3 December 2018 (UTC)
- >It sounds like the answer is yes, so I'll assume that's the case.
- In a perfect world, yes.
- >An alternative might be to do tokenization such that no characters are lost, but the text is still divided into "words" and other tokens. In such a scenario,
text."
would probably be indexed astext
,.
, and"
, and a search fortext."
- Indeed, perfect is the enemy of good. It is acceptable to have a search that will always match full tokens separated by spaces. That's the suggested approach in the thread (https://discuss.elastic.co/t/how-to-index-special-characters-and-search-those-special-characters-in-elasticsearch/42506). It seems quite sensible to do so for even the general search. I mean, it is quite silly that the search engine is unable to search for something as simple as "c++". In such a case, one would expect it to match "c" AND "C++", and prioritize the "c++" .
- There are even more cases, for instance, many people (myself included) like to sometimes learn about egyptian glyphs, and many of these convey meaning by themselves, yet searching for "☥" finds only one page, which is odd for something that can mean life. There are even weirder Egyptian symbols that I have no idea what they are called and they tend to be hard to describe. Google finds millions in sites, for en.wikipedia it currently finds (google:"☥" site:en.wikipedia.org) about 500. It a bit unfair to compare it to google because it likely has sophisticated artificial intelligence algorithms that simply translate the "☥" to Ankh, and also search it using that. Interestingly, even wikidata just drops the "☥".
- Anyway, there's no need to call it exact search, maybe it should just be called "tokensearch:" or anything related to that. As long as it removes all other unnecessary normalization. An alternative would be to enhance regex search to be able to work on the transcluded text (after the html is stripped). Unfortunately, the regex alternative is likely going to be even more costly. 197.218.84.247 (talk) 21:29, 3 December 2018 (UTC)
- Sidenote:
- A pretty nifty side-effect that cirrussearch's token stripping means that it even beats google and bing by showing some sensible results when someone searches for "〆okes". Google and bing currently find nothing.
- Still, it would be more sensible to add a general note informing the user whenever a special character that may be silently dropped is searched for. 197.218.84.247 (talk) 22:11, 3 December 2018 (UTC)
- I'm hoping to think more about this and get to this tomorrow afternoon. I've got a few deadlines that need my attention, plus an opportunity to discuss it with others early tomorrow. Hope to be back in less than 24 hours!
- Edit: If you are free in about 18 hours, join us to discuss this. More info on this etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours TJones (WMF) (talk) 21:30, 4 December 2018 (UTC)
- Sorry for the delay getting back to you. This didn't come up in our discussion today, but I was able to get my dev environment working again (lesson learned: never install major OS updates if you want to be able to get any work done).
- I was able to test all three of
☥,
〃,
and〆
with the current English-language analysis chain. It's actually the tokenizer that removes them. Long ago this would have surprised me, but I've recently seen problems with other tokenizers, and I think a common tokenizer design pattern is to handle characters you care about, and ignore or break on everything else, and not really look to closely at the behavior on "foreign" characters—which causes problems in Wikipedias and Wiktionaries especially, since they are always full of "foreign" characters. Anyway, the standard Elasticsearch tokenizer doesn't seem to care about☥,
〃,
and〆
—it doesn't just drop them, it breaks on them (sox☥y
is tokenized asx
andy
). - I set up a whitespace tokenizer–only analyzer, and it lets
☥,
〃,
and〆
pass through fine. However, it would not satisfy yourC
/C++
case.C++
would be tokenized asC++
and would not matchC
. And of course, our earlier examples oftext
,text,
,text.
, andtext"
would all be indexed separately, as wouldC++.
,"C++"
,"C++
,C++"
, and weird one-off tokens like"第31屆東京國際影展-
(which does occur in English Wikipedia). - So, while it is possible to use a whitespace tokenizer–only analyzer, I think the results would be counterintuitive to a lot of users, and I worry the required index for English Wikipedia would be huge. I'm not familiar with the super low-level implementation details of Elasticsearch, but adding extra occurrences of a token into an index generally uses less space than creating a new token, and there would be a lot of new tokens. We're already pushing the limits of our hardware (and in the middle of re-architecting our search clusters to handle it better).
- To summarize: my best guess right now is that the results would disappoint lots of users (who wouldn't expect punctuation on a word to matter, or would want to find punctuation even when attached to a word)—though this is hard to test. I also think the index would be prohibitively large (especially for the number of users who would use such a feature)—the index size part is testable, but non-trivial, so I haven't done it; the number of users is unclear, but most special syntax and keywords are used quite infrequently overall, even if particular users use them very heavily.
- I'm sorry to disappoint—I'm always happy when on-wiki search does something better than Google!—but I don't think this is feasible given the likely cost/benefit ratio. Though if you want to open a Phabricator ticket, you can—and pointing back to this talk page would be helpful. I can't promise we'll be able to look at it in any more depth than I already have any time soon, though. TJones (WMF) (talk) 22:38, 5 December 2018 (UTC)
- Hmm, too late. Hope it was a fruitful discussion...
- I do appreciate that it is a complicated problem that will likely not be addressed in the next 6 months or it might simply be deemed unfeasible. One could partially address it by doing what book authors do, create a glossary of 'important' tokens, and whenever search fails it could inform the user that "hey, the token you're searching for definitely exists but search limits mean that it can't be displayed". 197.218.80.248 (talk) 22:40, 5 December 2018 (UTC)
- I replied shortly before your previous reply so I missed the latest one. Anyway, your assessment seems pretty accurate so there is probably little benefit to filing a task. Of course, other developers might have different ideas on how it could be implemented, or even elasticsearch developers might have some tricks up their sleeves to make it feasible. It is still something that would probably only benefit third parties who aren't bogged down by millions of documents.
- Personally, I'm a fan of simplicity, so if I were to code it, the emphasis would be on the differences rather than the similarities. While there are millions of documents with similar symbols, some tokens are just rare enough to make it useful. For instance, currently, this discussion is probably one of a few places (if not the only one) in wikimedia projects that actually has an "x☥y" string. It is also enough to notify the user that X exists, rather than simply say "nothing was found", and that would in fact be quite trivial, even without elasticsearch.
- To put it into perspective, English wikipedia users (or bots) spend an extreme amount of time creating redirects for typos, for symbols, for many other tokens. They probably learned to do this early on to address the limitations of the search engine. Other wikis aren't so lucky, so the search there would probably be considerably worse. My guess is that only places like wiktionary which by default contains so many synonyms would fare better. Considering that wikidata sitelinks also contain a lot of aliases, it might also eventually be used to bridge the gap if the issue of vandalism and potentially completely wrong information could be properly addressed.
- Anyway, thank you for your assessment, I certainly don't want to give you unnecessary work for something that is very likely to be unfeasible. The current regex search certainly addresses most use cases (except transcluded content).
197.218.84.1 (talk) 10:10, 6 December 2018 (UTC)- Thanks for the discussion. It's an interesting problem, and some of the stuff we talked about here will definitely go into my future thoughts about evaluating and testing analyzers. TJones (WMF) (talk) 16:49, 6 December 2018 (UTC)
- I thought about this some more, and came up with the idea of a "rare character" index, which, in English, would ignore at least A-Z, a-z, 0-9, spaces, and most regular punctuation, but would index every instance of other characters. I talked it over with @DCausse (WMF), and he pointed out that it could be not only possible, but would also probably be much more manageable if the indexing was at the document level. (So you could search for documents containing both
☥
and〆
, but you could not specify a phrase like"☥ 〆"
or"〆 ☥"
. or a single "word" like☥☥
or our old friendx☥y
.) - I also think we could test this without a lot of development work by running offline simulations to calculate how big the index would be, and even build a test index on out search test servers without writing any real code by doing a poorly-implemented version with existing Elasticsearch features. More details on the phab ticket I've opened to document all those ideas: T211824.
- If you have any ideas about specific use cases and how this would or would not help with them, reply here or on Phab!
- I can't promise we'll get to this any time soon, but at least it will be on our work board, mocking me, so I feel bad about not getting to it! 😁 TJones (WMF) (talk) 22:03, 12 December 2018 (UTC)
- This seems like a reasonable outcome, and the idea is solid. For the unresolved questions:
- > Do we index the raw source of the document, or the version readers see?
- The raw source is already available using insource, so my suggestion is that this would only consider the reader's version.
- > Do we index just the text of the document, or also the auxiliary text and other transcluded text?
- Transcluded content seems like something that is definitely worthwhile. So perhaps all of the above if it is feasible.
- > It is possible (even desirable) that some documents would not be in this index because they have nothing but “boring” characters in them.
- Certainly desirable.
- >I can't promise we'll get to this any time soon, but at least it will be on our work board, mocking me, so I feel bad about not getting to it
- That's understandable. This would probably not be something used by the average user, but it will definitely make the search more complete because it highlights the important difference between a generic search engine like "google" and a specialized one that is used to identify encyclopedic / wiki content.
197.218.86.137 (talk) 19:53, 13 December 2018 (UTC)- Another use-case might be counter vandalism or small fixes. I seem to remember that when using VisualEditor in linux pasting something would often produce a “☁" , e.g. like this article (https://en.wikipedia.org/w/index.php?title=Rabbit,_Run&oldid=863620024). Of course emojis in articles are enough of a problem that there is a abusefilter blocking some of these (see Special:Tags, https://meta.wikimedia.org/wiki/Special:AbuseFilter/110).
- So it might be a good thing if it can act as a filter, e.g. "char:" would match all instances of special characters, and "-char:" would exclude them. This seems like a general feature that would help with a lot of things, for instance "-hascategory:" would be the equivalent of Special:UncategorizedPages , or "-linksto:" would be Special:DeadendPages, and so forth.
- Alternatively a separate keyword could be used if such a syntax seems odd, maybe "-matchkey:char", "matchkey:char", "-matchkey:category". 197.218.95.117 (talk) 11:10, 17 December 2018 (UTC)
- This might make the case for a generic emoji flag, maybe "char:emoji" that would match a smaller set of these things. A couple of funny related tasks:
- The abominable snowmen ☃ https://phabricator.wikimedia.org/T59884(https://fr.wikipedia.org/w/index.php?title=Lyc%C3%A9e_de_Saint-Just&curid=4133321&diff=154878524&oldid=154878318)
- Eerie clouds ☁ - https://phabricator.wikimedia.org/T126047
- Sunny days (☀) with umbrella(☂) - https://phabricator.wikimedia.org/T129310
- Real TV ->https://fr.wikipedia.org/w/index.php?title=Chace_Crawford&curid=2300505&diff=154721520&oldid=151044938
- 197.218.95.117 (talk) 11:43, 17 December 2018 (UTC)
- I think
-char:
would work.-insource:
and the like already work, so that shouldn't be a problem. I'm not sure about category searches. - I could see
char:emoji
being useful, but also really hard to implement. Here's an attempt at a general purpose emoji regex—that's pretty complicated! I can't find any widely defined Unicode regexes for emoji that are already built into Java or other programming languages. We could possibly look into it though if the time comes. I'll add it to the phab ticket. Thanks! TJones (WMF) (talk) 22:31, 17 December 2018 (UTC) - > I think
-char:
would work.-insource:
and the like already work, so that shouldn't be a problem. I'm not sure about category searches. - You probably misunderstood. The negative operator does already work, but it doesn't work in instances where someone just wants to finds all instances that exclude that keyword.
- For instance, if I want to find all articles that don't contain a link (e.g. like this, Monkey (slang)will be found as a false positive) or category (e.g. https://en.wikipedia.org/w/index.php?search=monkey+-category%3A). It is downright impossible, regex might get you close, but template transclusions can add extra links or categories or whatever. In fact, a completely empty page might still have links, as interwiki links can be added by wikidata.
- Similarly, if one wants to search all articles that contain any "rare" character it will be impossible, just as it is right now.
- >I could see
char:emoji
being useful, but also really hard to implement. Here's an attempt at a general purpose emoji regex—that's pretty complicated! I can't find any widely defined Unicode regexes for emoji that are already built into Java or other programming languages. - Perhaps if it isn't feasible then the syntax of char could be defined with a separator like "|",e.g.
char:😁|💃|💀
that would make it possible for users to define longer sequences without the awkward "char:x char:y" syntax. - For java there seems to be some ideas on how to deal with them:https://stackoverflow.com/a/32872406. 197.218.92.53 (talk) 11:16, 18 December 2018 (UTC)
- That's an interesting negative use of
-insource
. I'm not familiar with any syntax that allows you to search for a bare keyword or its negation, so I'm not really sure what you want it to mean. (In the search you linked to, it actually just omits articles with forms of the word insource (insourced, insourcing, etc.).) - I have foolishly started 4 more threads on this topic (on 3 village pumps and on Phabricator), but the idea of searching for multiple characters, character ranges, or Unicode blocks have come up elsewhere. There are issues of making the syntax consistent (a separate on-going project is trying to revamp the search parser), determining whether a multi-character search is an implicit AND or OR, and being careful about search syntax that explodes into many individual searches on the back end that we have to worry about. If we get far enough to actually implement a rare character index, we'll have to come back to the questions of specific syntax and the initial feature set supported.
TJones (WMF) (talk) 17:47, 18 December 2018 (UTC)- Oh, it was copied incorrectly, the insource was meant to include a string, https://en.wikipedia.org/w/index.php?search=monkey+-insource%3A%2F%5C%5B%5C%5B%2F.
- Anyway, the point of the regex above was to find pages without links or categories containing the word monkey. In practice there are none, in theory that one match occurs because regex doesn't search transcluded content, and there are different ways to create a link. I'm not exactly sure the correct terminology used for those, but to put it into concrete words, or rather pseudo code:So in essence it discards all pages that contain any rare characters. Based on existing search keywords, the only way to find all pages with rare characters would be spell them all out. Anyway, apparently there is one search keyword that works like that, "prefer-recent", compare https://en.wikipedia.org/w/index.php?search=monkey+-prefer-recent%3A&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B0%5D%7D&ns0=1 vs https://en.wikipedia.org/w/index.php?search=monkey&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B0%5D%7D&ns0=1 .
var excludedfilter = "-char:"; var search_results = {"pageswithout_char" : [1,2], "pageswith_char": [3,4]}; if excludedfilter == "-char:" then var pagesToSearch =searchresults[pageswith_char]; return search("foo", pagesToSearch); end
- While they look quite similar the order is different, and the help page itself claims that prefer-recent can work without any specific parameters. Nonetheless, it is strange and error prone syntax, so it seems more sensible to assign another keyword, or perhaps add a new url parameter, maybe something like
?excludefilter=char|hascategory&includefilter=hastemplate.
- Generally, getting feedback from various places at least (in)validates the idea, and people are more comfy in their own wikis, so a single discussion here would probably not get much feedback even if links were posted. 197.218.92.53 (talk) 19:55, 18 December 2018 (UTC)
- Oops the pseudo code should be more like :197.218.92.53 (talk) 19:58, 18 December 2018 (UTC)
var excludedfilter = "-char:"; var search_results = {"pageswithout_char" : [1,2], "pageswith_char": [3,4]}; if excludedfilter == "-char:" then var pagesToSearch =searchresults[pageswithout_char]; return search("foo", pagesToSearch); end
- I get what you are saying now. Is this a theoretical exercise, or do you have a specific use case where finding all pages without any rare characters would be useful? I can't think of any. In the case of a page with no links, you could argue that almost every page should have some links, so those are pages that need improving. Same for categories. But what's the value of finding pages with no rare characters—other than maybe as a conjunct with a more expensive search to limit its scope? (Though, I'm not sure how limiting that would be, so it makes sense to check that out in initial investigation—I'll add it to the phab ticket.) TJones (WMF) (talk) 20:21, 18 December 2018 (UTC)
- > Is this a theoretical exercise, or do you have a specific use case where finding all pages without any rare characters would be useful?
- Well, excluding them is a theoretical exercise . However, including all pages with any rare character ( "+char:") is a more useful query especially if filtered by category. For the original use-case of this thread, if one wants to evaluate pages mentioning historical symbols, one way to find a subset of them would be to use something like that.
- One could also imagine that regular wiki editors would use such an index to add new symbols to their emoji abusefilter or even track down (and cleanup) vandalism that randomly uses multiple emojis. Cloudy or other unkown emojis could be identified this way.
- Right now the only way to find any of them is to deliberately search for them using regex or analyse the wiki dumps. 197.218.92.53 (talk) 21:49, 18 December 2018 (UTC)
- I've added the editing error and vandalism use cases for emoji search to the Phab ticket. TJones (WMF) (talk) 15:53, 19 December 2018 (UTC)
updateSearchIndexConfig.php unable to determine Elasticsearch version (MW 1.31 / ES 5.6)
I'm encountering the following error when running updateSearchIndexConfig.php on ElasticSearch 5.6:
C:\wwwroot\mediawiki-1.31.0\extensions\CirrusSearch\maintenance>php updateSearchIndexConfig.php
content index...
Fetching Elasticsearch version...unable to determine, aborting.
Phpinfo() shows Curl 7.59 as enabled. While I get the JSON response from Elastic when I run http://localhost:9200/?pretty via the browser, I'm getting a bad URL error when using Curl and am suspecting it to be contributing to the problem.
Browswer response:
{
"name" : "xxxx",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "cdAqTAxVRJiEUCvQbWLX0g",
"version" : {
"number" : "5.6.0",
"build_hash" : "781a835",
"build_date" : "2017-09-07T03:09:58.087Z",
"build_snapshot" : false,
"lucene_version" : "6.6.0"
},
"tagline" : "You Know, for Search"
}
Curl response:
C:\Users\mfg_rmnguyen>curl 'http://localhost:9200'
curl: (3) URL using bad/illegal format or missing URL
Can anyone give me tips on how to overcome the above?
-Richard MadX (talk) 08:33, 3 December 2018 (UTC)
- I think that when you test using a windows command prompt you have to remove the single quotes around the URL so that
curl 'http://localhost:9200/'
becomescurl http://localhost:9200
. - As for the problem with the maintenance script, could you post the changes you've made to your LocalSettings.php? DCausse (WMF) (talk) 09:35, 3 December 2018 (UTC)
- Thanks for the response. When I removed the single quotes you suggested, it gave me an HTML response containing the message:
- "Network Error (dns_unresolved_hostname) Your requested host 'localhost' could not be resolved by DNS."
- I tried removing my proxy config and adding 127.0.0.1 / localhost to the Windows hosts file but no change. I'll continue looking at this piece.
- This is the LocalSettings.php CirrusSearch config:
wfLoadExtension( 'Elastica' );
require_once "$IP/extensions/CirrusSearch/CirrusSearch.php";
$wgDisableSearchUpdate = true;
$wgCirrusSearchServers = array( 'localhost' );
- I also tried setting
$wgCirrusSearchServers
to the hostname without any improvement. MadX (talk) 17:41, 3 December 2018 (UTC)
CirrusSearch for MW1.31 with ICU plugin support?
I had posted a question in the past for MW 1.23. I have installed Elastica, Elasticsearch and CirrusSearch and I am not sure what I need to do know in order to create an index which is diacritics-insensitive even for Polytonic Greek. The last info I had was to install analysis-icu plugin and the Extra Queries and Filters but I am not sure which version of these and how (and the compatibility with Elastica 5.6.13 which I installed). Spiros71 (talk) 12:20, 6 December 2018 (UTC)
- MW 1.31 should support Elastic 5.6.13 and ICU folding, you need to install the two plugins you mentioned.
- Elasticsearch plugins are generally following elasticsearch versions. The analysis-icu plugin being maintained by elastic itself it's always up to date. The Extra plugin being maintained by the WMF is not guaranteed to be available for all elasticsearch versions available. I've just released the version 5.6.13 which should be compatible with the version of elasticsearch you plan to use.
- So assuming that Manual:$wgLanguageCode is set to
el
on this wiki and that you install the analysis-icu and extra plugin should have ICU folding enabled everywhere (completion search and fulltext search). Note that a reindex is required. - If the language code is not set to
el
yon can force enable ICU by setting$wgCirrusSearchUseIcuFolding = 'yes';
. DCausse (WMF) (talk) 16:54, 6 December 2018 (UTC) - Thank you so much for prompt reply. Both installed successfully. Spiros71 (talk) 21:04, 6 December 2018 (UTC)
- Just in case this proves helpful to someone else. Indexing would stop half-way urging to use
$wgShowExceptionDetails = 'true';
for debug info. After doing that and indexing again, this came up: MWUnknownContentModelException from line 306 of public_html/includes/content/ContentHandler.php: The content model 'Scribunto' is not registered on this wiki.
See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
Backtrace:
#0 public_html/includes/content/ContentHandler.php(243): ContentHandler::getForModelID(string)
#1 public_html/includes/Title.php(4984): ContentHandler::getForTitle(Title)
#2 public_html/includes/parser/Parser.php(892): Title->getPageLanguage()
#3 public_html/includes/parser/Parser.php(2126): Parser->getTargetLanguage()
#4 public_html/includes/parser/Parser.php(2091): Parser->replaceInternalLinks2(string)
#5 public_html/includes/parser/Parser.php(1318): Parser->replaceInternalLinks(string)
#6 public_html/includes/parser/Parser.php(443): Parser->internalParse(string)
#7 public_html/includes/content/WikitextContent.php(323): Parser->parse(string, Title, ParserOptions, boolean, boolean, integer)
#8 public_html/includes/content/AbstractContent.php(516): WikitextContent->fillParserOutput(Title, integer, ParserOptions, boolean, ParserOutput)
#9 public_html/includes/content/ContentHandler.php(1324): AbstractContent->getParserOutput(Title, integer, ParserOptions)
#10 public_html/extensions/CirrusSearch/includes/Updater.php(363): ContentHandler->getParserOutputForIndexing(WikiPage, ParserCache)
#11 public_html/extensions/CirrusSearch/includes/Updater.php(204): CirrusSearch\Updater->buildDocumentsForPages(array, integer)
#12 public_html/extensions/CirrusSearch/maintenance/forceSearchIndex.php(218): CirrusSearch\Updater->updatePages(array, integer)
#13 public_html/maintenance/doMaintenance.php(94): CirrusSearch\ForceSearchIndex->execute()
#14 public_html/extensions/CirrusSearch/maintenance/forceSearchI
- This was resolved with:
UPDATE page SET page_content_model = 'wikitext' WHERE page_content_model = 'Scribunto'
- And after that reindexing was successful. Spiros71 (talk) 11:10, 8 December 2018 (UTC)
Failed connect to localhost:9200; Connection refused
Whilst it was working perfectly, today I saw that error (when running
curl localhost:9200
). Checking into the logs it seems that there was an automatic update to ES 5.6.14 and the extra plugin is not compatible?
[2018-12-12T00:26:35,535][INFO ][o.e.n.Node ] [1eBj8M8] stopping ...
[2018-12-12T00:26:35,554][INFO ][o.e.n.Node ] [1eBj8M8] stopped
[2018-12-12T00:26:35,554][INFO ][o.e.n.Node ] [1eBj8M8] closing ...
[2018-12-12T00:26:35,558][INFO ][o.e.n.Node ] [1eBj8M8] closed
[2018-12-12T00:26:36,273][ERROR][o.e.b.Bootstrap ] Exception
java.lang.IllegalArgumentException: plugin [extra] is incompatible with version [5.6.14]; was designed for version [5.6.13]
at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:146) ~[elasticsearch-5.6.14.jar:5.6.14]
Spiros71 (talk) 07:09, 12 December 2018 (UTC)
- Elasticsearch plugins must be upgraded everytime you upgrade elasticsearch.
- I'll push a 5.6.14 release of the extra plugin so that you can upgrade your installation. DCausse (WMF) (talk) 11:24, 12 December 2018 (UTC)
- Right, thank you! So I stop Elasticsearch, remove old plugin, run
bin/elasticsearch-plugin install org.wikimedia.search:extra:5.6.14
and then restart Elastica? I am guessing I have to wait until you tell me it has been released. - By the way, do we know which Elasticsearch version will be compatible with MW 1.32/3 version? Spiros71 (talk) 12:24, 12 December 2018 (UTC)
- I got distracted yesterday and forgot to send the 5.6.14 release, it should be available in few hours. Yes but just running
bin/elasticsearch-plugin install org.wikimedia.search:extra:5.6.14
should upgrade it if I recall correctly. - As for the MW/Elasticsearch compatibility matrix we try to maintain this information in Extension:CirrusSearch (I've just updated with the MW 1.32 information).
- MW 1.33 is likely to require elasticsearch 6.x. DCausse (WMF) (talk) 08:33, 13 December 2018 (UTC)
- Thank you, David :) Installed just fine.
curl localhost:9200
returning results but search is not kicking in. - In the log I see:
2018-12-13T13:43:37,516INFO o.e.p.PluginsService 1eBj8M8 loaded plugin analysis-icu
2018-12-13T13:43:37,516INFO o.e.p.PluginsService 1eBj8M8 loaded plugin extra
2018-12-13T13:43:38,330INFO o.e.d.DiscoveryModule 1eBj8M8 using discovery type zen
2018-12-13T13:43:38,658INFO o.e.n.Node initialized
2018-12-13T13:43:38,658INFO o.e.n.Node 1eBj8M8 starting ...
2018-12-13T13:43:38,737INFO o.e.t.TransportService 1eBj8M8 publish_address {127.0.0.1:9300}, bound_addresses {::1:9300}, {127.0.0.1:9300}
2018-12-13T13:43:41,773INFO o.e.c.s.ClusterService 1eBj8M8 new_master {1eBj8M8}{1eBj8M8-TjCtNBEGctBh3A}{JV2zKRWlR_ia-4czzUe4Tg}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master (0 nodes joined)
2018-12-13T13:43:41,784INFO o.e.h.n.Netty4HttpServerTransport 1eBj8M8 publish_address {127.0.0.1:9200}, bound_addresses {::1:9200}, {127.0.0.1:9200}
2018-12-13T13:43:41,785INFO o.e.n.Node 1eBj8M8 started
2018-12-13T13:43:42,064INFO o.e.g.GatewayService 1eBj8M8 recovered 3 indices into cluster_state
2018-12-13T13:43:42,157INFO o.e.c.r.a.AllocationService 1eBj8M8 Cluster health status changed from RED to GREEN (reason: shards started mw_cirrus_metastore_first0 ...).
- In depreciation log I see:
[2018-12-13T13:43:42,074][WARN ][o.e.d.i.m.TypeParsers ] field [include_in_all] is deprecated, as [_all] is deprecated, and will be disallowed in 6.0, use [copy_to] instead.
Spiros71 (talk) 11:56, 13 December 2018 (UTC)- If your setup was returning results while running elastic 5.6.13 I see no reasons it could not using 5.6.14...
- Could you double check that the indices are populated correctly? (using
curl localhost:9200/_cat/indices?v
should give you the number of docs: docs.count - docs.deleted). DCausse (WMF) (talk) 13:53, 13 December 2018 (UTC) - One never knows... even a (minor) new version might have introduced some sort of incompatibility...
- I reindexed (2 hours for 600,000 pages), still the same. When checking data on
/var/lib/elasticsearch
I can see very small files of a few bytes each. curl localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open wiki_1_31_0_content_first KOLZXJrpTle_kDQOuS-sYQ 4 0 0 0 648b 648b
green open mw_cirrus_metastore_first KWF6MOHuSQOd-d4QCp9zbg 1 0 3 6 11.2kb 11.2kb
green open wiki_1_31_0_general_first NHAdshY6Sc-wa5s590gPHw 4 0 33 0 13.2kb 13.2kb
- As for LocalSettings.php, no change:
wfLoadExtension( 'Elastica' );
require_once "$IP/extensions/CirrusSearch/CirrusSearch.php";
$wgCirrusSearchUseIcuFolding = 'yes';
$wgSearchType = 'CirrusSearch';
Spiros71 (talk) 14:25, 13 December 2018 (UTC)- The indices are empty.
- Anything in the mediawiki logs?
- What did you run to populate the index and do you still have the output the command? DCausse (WMF) (talk) 15:53, 13 December 2018 (UTC)
- I run these (as I had done in the past without any issues):
php updateSearchIndexConfig.php --startOver
php forceSearchIndex.php
- After running the the second one there was the count of indexed pages and then it completed without error.
- Not sure about where to find MW logs. Nothing recent on error.log in installation directory. Spiros71 (talk) 16:02, 13 December 2018 (UTC)
- Finally, I downgraded to Elasticsearch 5.6.13 and the index was created. However, although I used
$wgCirrusSearchUseIcuFolding = 'yes';
.the diacritics-insensitive search has issues. For example entering: - ανθρωπος will not show ἄνθρωπος (nor άνθρωπος)
- anthropos will not show ánthrōpos (which is a redirect, I don't know if it gets penalized for being a redirect or simply diacritics stripping does not apply for extended Latin on the suggester or whether simply it has not been indexed successfully; I would think it may be the last, since even entering most of the word, i.e. ánthrōpo will not bring it as a suggestion; it will bring though other words with same letters but without diacritics like anthropographos).
- On the contrary, entering ανθρωπος in en.wiktionary.org will show both ἄνθρωπος and άνθρωπος. Spiros71 (talk) 22:35, 13 December 2018 (UTC)
- One more reindexing from scratch resolved this. For some reason it created slightly bigger indexes. Spiros71 (talk) 19:59, 14 December 2018 (UTC)
Suggestion:When a query matches a redirect search for both the page and its redirect
Issue There are some searches which will return only a redirect, yet this redirect matches even more pages. Proposed solution Check if token is a redirect, and search for both:
- A user searches for a token, e.g. "☂"
- The search engine verifies if it matches (e.g exact match) a redirect (https://en.wikipedia.org/w/api.php?action=query&format=json&titles=%E2%98%82&redirects=1&converttitles=1), e.g. "umbrella"
- Then search for both, e.g. "☂" OR "umbrella"
Example:
- Compare https://en.wikipedia.org/w/index.php?search="☂" vs https://en.wikipedia.org/w/index.php?search=%22%E2%98%82%22+OR+%22umbrella%22&title=Special%3ASearch&go=Go
- Compare https://en.wikipedia.org/w/index.php?search=automobilevs https://en.wikipedia.org/w/index.php?search=automobile+OR+Car
- https://en.wikipedia.org/w/index.php?search=☥vs https://en.wikipedia.org/w/index.php?search=☥+OR+Ankh
Note: It might be useful to give a user the option to only search for the token in case they deliberately want less results. Alternatively, this may trigger only when the search term doesn't contain a quote, e.g. "☂" vs ☂ . This will undoubtedly improve search results in many cases, including the ones discussed in Extension talk:CirrusSearch/2018#h-Suggestion:_Provide_a_plain_search_(no_analysis)-2018-11-28T16:02:00.000Z . 197.218.95.117 (talk) 14:45, 17 December 2018 (UTC)
- My understanding is that this is a form of psuedo relevance feedback. Psuedo relevance feedback, more generally, uses the top-n results of an initial query to perform query expansion. This is something that might be investigated at some point, but is a significant undertaking to do well. I've created https://phabricator.wikimedia.org/T215371 to potentially investigate. EBernhardson (WMF) (talk) 00:58, 6 February 2019 (UTC)