Extension talk:CirrusSearch
![]() Archives
|
---|
- Discussion related to the CirrusSearch MediaWiki extension.
- See also the open tasks for CirrusSearch on phabricator.
How to search for ASCII translated Umlaut handling in URLs in source code using quotes?
I have template source code using external URLs. Some contain ASCII-translated Umlauts like fl%C3%BCgel
for flügel
.
When searching via API using insource and quotes it doesn't find it.
insource:"/path/fl%C3%BCgel"
It only finds it without quotes insource:/path/fl%C3%BCgel
92.50.65.235 16:59, 28 January 2025 (UTC)
Elasticsearch
Looking on the Elasticsearch website and having trouble. Elastic Cloud costs a fortune and Self-managed Elasticsearch options are "not suitable for production use". What gives? 81.151.8.175 15:48, 6 April 2025 (UTC)
Get CirrusSearch work for chinese
Hi all, I have a site that runs on 1.43, most of the contents are simplified & traditional chinese, and I am trying to get CirrusSearch + Elasticsearch work, but still struggling..
Basically I want to make it work just like https://zh.wikipedia.org/ (not sure if it is using the same approach CirrusSearch + Elasticsearch?) The primary issue I currently have is that, for e.g. if I search for "方济各", I want the results to only show pages that has this phrase "方济各" (like if you search for the same on wikipedia: https://zh.wikipedia.org/w/index.php?search=%E6%96%B9%E6%B5%8E%E5%90%84&title=Special%3A%E6%90%9C%E7%B4%A2&profile=advanced&fulltext=1&ns0=1&searchToken=68qlm8r96e8225klwzw6fkjyd), not ["方" or "济" or "各"] (which is what it currently is doing).. Here is my current test instance if you want to try it: http://44.199.64.14/w/index.php?search=%E6%96%B9%E6%B5%8E%E5%90%84&title=Special%3A%E6%90%9C%E7%B4%A2&wprov=acrw1_-1&ns0=1&ns1=1&ns14=1&ns4100=1&ns4200=1
Here are what I have done so far:
Installed Elasticsearch 7.10.2
Installed extension CirrusSearch and Elastic
Installed Elasticsearch plugin: analysis-ik(IK Analyzer), and analysis-stconvert
I have also modified the /extensions/CirrusSearch/includes/Maintenance/AnalysisConfigBuilder.php file to include the reference to IK analyzer.
I also just read this page User:TJones (WMF)/Notes/Chinese Analyzer Analysis, that "The short version is that SmartCN+STConvert did the best on all the corpora", so... does this mean out of box (SmartCN+STConvert) it should be working fine for chinese, and I don't need IK analyszer plugin? If the SmartCN+STConvert is enough, what I am missing to make it work like https://zh.wikipedia.org/w/index.php?search=%E6%96%B9%E6%B5%8E%E5%90%84&title=Special%3A%E6%90%9C%E7%B4%A2&profile=advanced&fulltext=1&ns0=1&searchToken=68qlm8r96e8225klwzw6fkjyd?
Thank you everyone! Paulxu20 (talk) 03:12, 16 May 2025 (UTC)
- Just some updates after discussion with @DCausse (WMF) in IRC:
- Looks like if just to get it working like zh.wikipedia.org, I dont need the IK analyzer, so I am now reverting my code changes to AnalysisConfigBuilder.php back to the OOB version.
- I do have a question now - is that do I need to install the "analysis-icu" plugin to make the chinese search work well? it looks like so but want to double check. Paulxu20 (talk) 20:21, 16 May 2025 (UTC)
- Hi,
analysis-icu
is a useful plugin and it's generally a good idea to install it but whether or not it is useful for Chinese? I would defer this to @TJones (WMF) to answer. DCausse (WMF) (talk) 09:28, 19 May 2025 (UTC)- Thank you @DCausse (WMF) @TJones (WMF), I do have analysis-icu installed along with a few other plugins, here is the list of all plugins currently installed:
:::name component version :::ip-172-26-3-55 analysis-icu 7.10.2 :::ip-172-26-3-55 analysis-ik 7.10.2 :::ip-172-26-3-55 analysis-smartcn 7.10.2 :::ip-172-26-3-55 analysis-stconvert 7.10.2 :::ip-172-26-3-55 extra 7.10.2-wmf12 :::
- I have put all the stuff I did on this page: http://44.199.64.14/wiki/CirrusSearch_Test, including the plugins I installed, the LocalSettings.php, the command I ran, etc. After a week of trying and digging I think I am making some progress, for example now when searching for "方济各", it does find the page (http://44.199.64.14/wiki/TestPage2) which contains the whole phrase and show it on top of the search results, however, at the same time the search results are still showing other pages that either has "方", or "济", or "各". Not sure what I am missing..
- Really appreciate all your help! Paulxu20 (talk) 13:03, 19 May 2025 (UTC)
- Hello @DCausse (WMF) @TJones (WMF)
- Just to follow up after more digging today, I realized that zh.wikipedia.org's search is also not working well.. it is just that there are many pages contains "方济各" and they are ranked high, so they are being list on top of the search results, but when I check more pages, for e.g. 1500 pages later (https://zh.wikipedia.org/w/index.php?limit=500&offset=1500&profile=default&search=%E6%96%B9%E6%B5%8E%E5%90%84&title=Special:%E6%90%9C%E7%B4%A2&ns0=1), it is also showing pages that has nothing to do with the phrase "方济各", for e.g. this page "https://zh.wikipedia.org/zh-cn/%E7%BE%8E%E6%B5%8E%E7%A4%81" is in the search result, it is an island name and absolutely has nothing to do with "方济各" (who is the name of Pop who just passed away).
- What should happen, is that when searching for "方济各", it should be the combined results of below two queries (with duplicates removed):
- 1- https://zh.wikipedia.org/w/index.php?search=%22%E6%96%B9%E6%B5%8E%E5%90%84%22&title=Special%3A%E6%90%9C%E7%B4%A2&profile=advanced&fulltext=1&advancedSearch-current=%7B%22fields%22%3A%7B%22phrase%22%3A%22%5C%22%E6%96%B9%E6%B5%8E%E5%90%84%5C%22%22%7D%7D&ns0=1
- 2- https://zh.wikipedia.org/w/index.php?search=%22%E6%96%B9%E6%BF%9F%E5%90%84%22&title=Special%3A%E6%90%9C%E7%B4%A2&profile=advanced&fulltext=1&advancedSearch-current=%7B%22fields%22%3A%7B%22phrase%22%3A%22%5C%22%E6%96%B9%E6%BF%9F%E5%90%84%5C%22%22%7D%7D&ns0=1
- The first one is the query when doing exact match for simplified Chinese "方济各", and the second one is the query to exact match for traditional Chinese "方濟各", which is the same phrase "方济各" but just traditional Chinese.
- So there are two problems here:
- 1) should only show pages that contains the exact phrase "方济各"
- 2) should be able to show results for both simplified Chinese and traditional Chinese, irrespective what form the original keyword is (simplified or traditional)
- Let me know if this makes sense. Paulxu20 (talk) 20:28, 19 May 2025 (UTC)
- Hi,
- With @DCausse (WMF)'s help (Big thank you!), I was able to get it working for Chinese, below are the details just in case anyone needs it:
- === Install related Mediawiki extensions: Elastica and CirrusSearch ===
- LocalSettings.php:
wfLoadExtension( 'Elastica' ); wfLoadExtension( 'CirrusSearch' ); $wgSearchType = 'CirrusSearch'; $wgCirrusSearchServers = [ 'localhost' ]; $wgCirrusSearchIndexBaseName = 'mediawiki'; $wgCirrusSearchWikimediaExtraPlugin = true; $wgCirrusSearchLanguage = 'zh'; $wgCirrusSearchUseIcuFolding = true; $wgCirrusSearchUseIcuTokenizer = true; $wgCirrusSearchEnableRegex = true; // Strongly boost exact phrase $wgCirrusSearchPhraseRescoreBoost = 500.0; // Rescore more top candidates $wgCirrusSearchPhraseRescoreWindowSize = 20; // Only match strict phrases $wgCirrusSearchPhraseSlop = [ 'precise' => 1, 'default' => 0, 'boost' => 0 ]; $wgCirrusSearchWikimediaExtraPlugin = [ 'token_count_router' => true ];
- ==== Install analysis-icu ====
- sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu
- ==== Install analysis-smartcn ====
- sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-smartcn
- ==== Install analysis-stconvert ====
- sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install https://release.infinilabs.com/analysis-stconvert/stable/elasticsearch-analysis-stconvert-7.10.2.zip
- ==== Install extra ====
- sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install https://repo1.maven.org/maven2/org/wikimedia/search/extra/7.10.2-wmf12/extra-7.10.2-wmf12.zip
- ==== Then restart elasticsearch ====
- sudo systemctl restart elasticsearch
- ==== Verify plugins are installed successfully ====
- curl -X GET "localhost:9200/_cat/plugins?v"
name component version ip-172-26-3-55 analysis-icu 7.10.2 ip-172-26-3-55 analysis-ik 7.10.2 ip-172-26-3-55 analysis-smartcn 7.10.2 ip-172-26-3-55 analysis-stconvert 7.10.2 ip-172-26-3-55 extra 7.10.2-wmf12
- ==== List all current indexes ====
- curl -s "localhost:9200/_cat/indices?v" | less -S
- ==== Recreate index ====
- Updates the configuration of the Elasticsearch index used by MediaWiki search, and rebuilds (reindexes) the index with a new configuration. Note that this does NOT populate (reindex) wiki content into the index. It just creates the “container” and defines how data should be indexed. It is
ForceSearchIndex.php
that Populates (or repopulates) the Elasticsearch index with content from your wiki — essentially "reindexing" the content pages. - php UpdateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now
- or:
- php UpdateSearchIndexConfig.php --startOver[1]
You should run this after:
- Changing your analyzer settings in
CirrusSearch
config (like adding a custom IK analyzer). - Changing the way search behaves (e.g., adding language support or synonyms).
- Installing plugins like
stconvert
and updatingCirrusSearch
config to use them.
- After the command is done, check indexes again, should see the indexes are created successfully:
- curl -s "localhost:9200/_cat/indices?v" | less -S
- === Other Resources ===
- Elasticsearch plugins can be installed: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/b39cf71d8c9d8c0c0a9326eedeabbc5003f4ee60/debian/plugin_urls.lst
- Wikipedia.org cirrus-settings-dump: https://zh.wikipedia.org/w/api.php?action=cirrus-settings-dump
- Wikipedia.org cirrus-mapping-dump: https://zh.wikipedia.org/w/api.php?action=cirrus-mapping-dump
- Wikipedia.org cirrus-config-dump: https://zh.wikipedia.org/w/api.php?action=cirrus-config-dump
- Mediawiki all settings: https://noc.wikimedia.org/wiki.php?wiki=zhwiki&format=json
- Paulxu20 (talk) 11:27, 25 May 2025 (UTC)
Autocomplete not updating with new page titles
When I create a new page, I do not see it when I start typing on search bar (even after hours). Job queue empty (cron job set up ever 3 mins).
Using
wfLoadExtension( 'Elastica' ); wfLoadExtension( 'CirrusSearch' ); $wgCirrusSearchUseIcuFolding = 'yes'; $wgSearchType = 'CirrusSearch';
MediaWiki | 1.39.12 |
PHP | 8.3.20 (fpm-fcgi) |
MariaDB | 10.6.21-MariaDB |
ICU | 67.1 |
Lua | 5.1.5 |
Elasticsearch | 7.10.2 |
Spiros71 (talk) 12:17, 18 May 2025 (UTC)
- @Spiros71 Few causes to explore:
- You are using the Completion suggester (
$wgCirrusSearchUseCompletionSuggester
), this index specialized to do autocompletion is not updated in realtime but via the maint scriptUpdateSuggesterIndex.php
- You may some issues writing to elasticsearch, is this page findable using
Special:Search
? If yes I'm not sure what could have happened and might require more investigations. If no you might get some insights from mediawiki logs indicating why the had page failed to get indexed?
- You are using the Completion suggester (
- Few techniques to help debugging:
- See how the page is indexed: append
?action=cirrusDump
to the page URL - See the query sent to elasticsearch: append
&cirrusDumpQuery=yes
- See how the page is indexed: append
- DCausse (WMF) (talk) 09:41, 19 May 2025 (UTC)
- Thank you David,
- There was no
$wgCirrusSearchUseCompletionSuggester = 'yes';
in LocalSettings.php (despite that, it did work without issues for quite some time). I could not find any instruction in the extension readme as for the necessity of the above (nor that it would be a good idea to run on a cron also). I ended up doing full reindexing:
php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --startOver php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipParse php extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php --recreate php maintenance/runJobs.php --memory-limit=max
- I also ran the below (as I had yellow status and 1 unassigned shard):
curl -X PUT "localhost:9200/_all/_settings" \ -H 'Content-Type: application/json' \ -d '{"index": {"number_of_replicas": 0}}' # …and make it the default for any new index CirrusSearch creates curl -X PUT "localhost:9200/_cluster/settings" \ -H 'Content-Type: application/json' \ -d '{ "persistent": { "index.number_of_replicas": 0 } }'
- Spiros71 (talk) 10:45, 19 May 2025 (UTC)
- The completion suggester is optional and completion should have worked even without having it enabled, sorry if my comment made it sound like it was required.
- Documentation about the completion is a bit sparse I agree, you might find some in
docs/settings.txt
and Extension:CirrusSearch/CompletionSuggester. DCausse (WMF) (talk) 21:10, 19 May 2025 (UTC)
- Spiros71 (talk) 10:45, 19 May 2025 (UTC)