Extension talk:CirrusSearch/2020
This page used the Structured Discussions extension to give structured discussions. It has since been converted to wikitext, so the content and history here are only an approximation of what was actually displayed at the time these comments were made. |
Discussion related to the CirrusSearch MediaWiki extension.
See also the open tasks for CirrusSearch on phabricator.
Understanding the difference in behavior between suggester and the actual search page
I'm a bit puzzled about behavior.
When trying to search for something you get the expected result in the suggester as suggestions.
But if you search the same identical term in Special:Search, you don't even get any results.
What's causing this behavior? My expected outcome would be something like the suggester.
My configuration:
$wgCirrusSearchUseCompletionSuggester = 'yes';
$wgCirrusSearchUseExperimentalHighlighter = true;
$wgCirrusSearchOptimizeIndexForExperimentalHighlighter = true;
$wgCirrusSearchAllowLeadingWildcard = false;
$wgCirrusSearchUseIcuFolding = true;
$wgCirrusSearchWikimediaExtraPlugin[ 'id_hash_mod_filter' ] = true;
$wgCirrusSearchWikimediaExtraPlugin[ 'super_detect_noop' ] = true;
I tried to insert links to external images but the abusefilter prevented me, just imagine the suggester returning results close to what I entered.
Example:
Search bar input: shale
Search bar suggester results: shalem, a redirect to another article page
Search page input: shale
Search page result: There were no results matching the query. Gyarujk (talk) 16:18, 4 January 2020 (UTC)
- At a glance I don't see why such behaviors could happen.
- Here are a few debug options that we often use to troubleshoot issues:
- add
&cirrusDumpQuery
to the Special:Search URL after hitting the search button to see the query sent to elastic - add
&cirrusDumpResult
to see the response received from elastic
- add
- I'd also look at the number of docs in your elasticsearch indices by simply running
curl elastic_host:9200/_cat/indices
you should see between 3 or 4 indices related to your wiki:- wikiname_content: fulltext search index for content namespaces
- wikiname_general: fulltext search index for other namespaces
- wikiname_archive: for title search in archive (may not appear if running an old version of CirrusSearch)
- wikiname_titlesuggest: created by
updateSuggesterIndex.php
and enabled thanks to$wgCirrusSearchUseCompletionSuggester
- If wikiname_content is empty and wikiname_titlesuggest is populated it could explain the behavior you see in which case you need to rebuild the fulltext indices by running
forceSearchIndex.php
. DCausse (WMF) (talk) 08:02, 8 January 2020 (UTC) - Hey!
- Thanks a lot for the response.
- I'll add that the stack is currently mw 1.33.2, CirrusSearch is 0.2 (2daa9b8) and elasticsearch itself is 6.3.1.
- I think my indices looks alright from a quick glance.
- curl http://localhost:9200/_cat/indices
- green open mediawiki_content_first TOY5V3J7TE2hzwCJei3QYg 4 0 7031 786 231.9mb 231.9mb
- green open mediawiki_general_first xsaWbMNJQxyFt5uVdCfeoQ 4 0 77128 11821 453.5mb 453.5mb
- green open mw_cirrus_metastore_first h2sbnA1NTj-W9nIAIGGi1w 1 0 45 6 19.7kb 19.7kb
- green open mediawiki_archive_first w8hB-GMTSxq0XA7WTh78rA 4 0 0 0 1kb 1kb
- green open mediawiki_titlesuggest_1578042049 wYk0aNL2RuCf-1GI-WUMBw 4 0 7773 0 2mb 2mb
- Sorry for the huge dump but I'm not able to pastebin this, this is what I get with &cirrusDumpResult
{
"description": "full_text search for 'shale'",
"path": "mediawiki_content\/page\/_search",
"result": {
"took": 5,
"timed_out": false,
"_shards": {
"total": 4,
"successful": 4,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
},
"suggest": {
"suggest": [
{
"text": "shale",
"offset": 0,
"length": 5,
"options": [
{
"text": "shard",
"highlighted": "\ue000shard\ue001",
"score": 0.04328454
}
]
}
]
},
"status": 200
}
- }
- and this is what I get with &cirrusDumpQuery
{
"description": "full_text search for 'shale'",
"path": "mediawiki_content\/page\/_search",
"params": {
"timeout": "20s",
"search_type": "dfs_query_then_fetch"
},
"query": {
"_source": [
"namespace",
"title",
"namespace_text",
"wiki",
"redirect.*",
"timestamp",
"text_bytes"
],
"stored_fields": [
"text.word_count"
],
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"query_string": {
"query": "shale",
"fields": [
"all.plain^1",
"all^0.5"
],
"auto_generate_phrase_queries": true,
"phrase_slop": 0,
"default_operator": "AND",
"allow_leading_wildcard": false,
"fuzzy_prefix_length": 2,
"rewrite": "top_terms_boost_1024"
}
},
{
"multi_match": {
"fields": [
"all_near_match^2"
],
"query": "shale"
}
}
],
"filter": [
{
"terms": {
"namespace": [
0
]
}
}
]
}
},
"highlight": {
"pre_tags": [
"\ue000"
],
"post_tags": [
"\ue001"
],
"fields": {
"title": {
"type": "experimental",
"fragmenter": "none",
"number_of_fragments": 1,
"matched_fields": [
"title",
"title.plain"
]
},
"redirect.title": {
"type": "experimental",
"fragmenter": "none",
"order": "score",
"number_of_fragments": 1,
"options": {
"skip_if_last_matched": true
},
"matched_fields": [
"redirect.title",
"redirect.title.plain"
]
},
"category": {
"type": "experimental",
"fragmenter": "none",
"order": "score",
"number_of_fragments": 1,
"options": {
"skip_if_last_matched": true
},
"matched_fields": [
"category",
"category.plain"
]
},
"heading": {
"type": "experimental",
"fragmenter": "none",
"order": "score",
"number_of_fragments": 1,
"options": {
"skip_if_last_matched": true
},
"matched_fields": [
"heading",
"heading.plain"
]
},
"text": {
"type": "experimental",
"number_of_fragments": 1,
"fragmenter": "scan",
"fragment_size": 150,
"options": {
"top_scoring": true,
"boost_before": {
"20": 2,
"50": 1.8,
"200": 1.5,
"1000": 1.2
},
"max_fragments_scored": 5000
},
"no_match_size": 150,
"matched_fields": [
"text",
"text.plain"
]
},
"auxiliary_text": {
"type": "experimental",
"number_of_fragments": 1,
"fragmenter": "scan",
"fragment_size": 150,
"options": {
"top_scoring": true,
"boost_before": {
"20": 2,
"50": 1.8,
"200": 1.5,
"1000": 1.2
},
"max_fragments_scored": 5000,
"skip_if_last_matched": true
},
"matched_fields": [
"auxiliary_text",
"auxiliary_text.plain"
]
},
"file_text": {
"type": "experimental",
"number_of_fragments": 1,
"fragmenter": "scan",
"fragment_size": 150,
"options": {
"top_scoring": true,
"boost_before": {
"20": 2,
"50": 1.8,
"200": 1.5,
"1000": 1.2
},
"max_fragments_scored": 5000,
"skip_if_last_matched": true
},
"matched_fields": [
"file_text",
"file_text.plain"
]
}
},
"highlight_query": {
"query_string": {
"query": "shale",
"fields": [
"title.plain^20",
"redirect.title.plain^15",
"category.plain^8",
"heading.plain^5",
"opening_text.plain^3",
"text.plain^1",
"auxiliary_text.plain^0.5",
"title^10",
"redirect.title^7.5",
"category^4",
"heading^2.5",
"opening_text^1.5",
"text^0.5",
"auxiliary_text^0.25"
],
"auto_generate_phrase_queries": true,
"phrase_slop": 1,
"default_operator": "AND",
"allow_leading_wildcard": false,
"fuzzy_prefix_length": 2,
"rewrite": "top_terms_boost_1024"
}
}
},
"suggest": {
"text": "shale",
"suggest": {
"phrase": {
"field": "suggest",
"size": 1,
"max_errors": 2,
"confidence": 2,
"real_word_error_likelihood": 0.95,
"direct_generator": [
{
"field": "suggest",
"suggest_mode": "always",
"max_term_freq": 0.5,
"min_doc_freq": 0,
"prefix_length": 2
}
],
"highlight": {
"pre_tag": "\ue000",
"post_tag": "\ue001"
},
"smoothing": {
"stupid_backoff": {
"discount": 0.4
}
}
}
}
},
"stats": [
"suggest",
"full_text",
"full_text_querystring"
],
"size": 21,
"rescore": [
{
"window_size": 8192,
"query": {
"query_weight": 1,
"rescore_query_weight": 1,
"score_mode": "multiply",
"rescore_query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "incoming_links",
"modifier": "log2p",
"missing": 0
}
}
]
}
}
}
}
]
},
"options": {
"timeout": "20s",
"search_type": "dfs_query_then_fetch"
}
}
- api.php?action=opensearch&format=json&formatversion=2&search=shale&namespace=0&limit=10&suggest=true
- returns
["shale",["Shalem","Share Chest","Scales of Dominion","Shareeravadi","Shade Knife","Sealed Claustrum","Scalewing"],["","","","","","",""],["https://gbf.wiki/Shalem","https://gbf.wiki/Share_Chest","https://gbf.wiki/Scales_of_Dominion","https://gbf.wiki/Shareeravadi","https://gbf.wiki/Shade_Knife","https://gbf.wiki/Sealed_Claustrum","https://gbf.wiki/Scalewing"]]
Gyarujk (talk) 10:37, 8 January 2020 (UTC)- I think I now understand.
- There are no pages containing the word shale. What you see from the suggestion box (top right for left-to-right languages) will search for title prefixes so that it can suggest pages while you type their titles thus probably suggesting shalem when typing shale.
- But once you hit enter without selecting one of the pages suggested you enter a the fulltext search mode which search for full words, shale not being a full word present in your corpus the fulltext search mode is not able to find anything.
- This is often the case in commercial search engine, if you hit enter with partially typed search query you'll search for this partially typed words even if you've been suggested ways to complete your query. DCausse (WMF) (talk) 16:06, 8 January 2020 (UTC)
no compatibility with DisplayTitle
RESOLVED | |
Tracked in T143396 |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
CirrusSearch doesn't index the display titles of Extension:Display Title but would it be possible to extend its functionality in the future? S0ring (talk) 12:10, 24 January 2020 (UTC)
- We started working on this but haven't had a change to finish this work, the display_title is now indexed there is a patch in progress to make use of it.
- See T143396. DCausse (WMF) (talk) 14:09, 24 January 2020 (UTC)
What would fix "Sort order of create_timestamp_desc is unrecognized"?
My base system is:
MediaWiki | 1.34.0 |
PHP | 7.2.24-0ubuntu0.18.04.2 (apache2handler) |
MariaDB | 10.1.43-MariaDB-0ubuntu0.18.04.1 |
ICU | 60.2 |
Elasticsearch | 6.5.4 |
And my plugins include:
AdvancedSearch | 0.1.0 (4affc9c) 03:23, 1 October 2019 |
CirrusSearch | 6.5.4 (a86e0a5) 23:14, 10 October 2019 |
Elastica | 6.0.2 (b33d985) 21:44, 10 October 2019 |
And maintenance/checkIndexes.php reports all OK. Where do I go to get the timestamp-based sort orders functioning? Thanks. WhitWye (talk) 20:20, 30 January 2020 (UTC)
- Okay, found what I'd missed: The CirrusSearch README includes the step of adding "$wgSearchType = 'CirrusSearch'". In looking back and forth I missed that. Since that step apparently does not change between versions of CirrusSearch, a more obvious place to put it might be in the main page here, rather than in the README, or to at least duplicate it here, as is done for the "wfLoadExtension( )" settings. There's no such thing as being overly obvious in documentation. WhitWye (talk) 20:51, 31 January 2020 (UTC)
Adding crawled external sites' SERPs to my wiki's search box
Hi,
I have a rather complicated question...
I run a wiki (tunearch.org) that uses elasticsearch/Elastica/CirrusSearch as search engine.
I also have another site (not wiki) from which I would like to collect data with a crawler and feed it to elasticsearch.
I searched on the net for some solutions and discovered that using some crawlers (Scrapy, Nutch, Norconex HTTP Collector, ...) you can build a spider and then inject the data collected in my elasticsearch.
Now my question is this: is it possible to use the Cirrus/Elastica/elasticsearch extensions architecture to allow all those who search for articles in my wiki to find also the relevant pages of the second (third, fourth, ...) site(s) crawled with the spider using the standard search box?
This is my MW configuration:
Product | Version |
---|---|
MediaWiki | 1.34.0 (ae6e0c0)
08:53, December 27, 2019 |
PHP | 7.2.17-0ubuntu0.18.04.1 (apache2handler) |
MariaDB | 10.1.38-MariaDB-0ubuntu0.18.04.1 |
ICU | 60.2 |
Lua | 5.1.5 |
LilyPond | 2.18.2 |
Elasticsearch | 5.6.16 |
Lucene | 6.6.1 |
CirrusSearch | 6.5.4 (a86e0a5) |
Elastica | 6.0.2 (b33d985) |
Any little help is really appreciated. Silkwood (talk) 12:44, 3 February 2020 (UTC)
"Content pages" is the default profile tab
The search bar has multiple profile tabs like "Content pages" (mainspace), "Multimedia" (File), "Everything" (all plus File), but the default search done by "Content pages". Is it possible to configure to start to search in "Everything" by default?
Example: "Content pages" (default) http://<hostname>/wiki/index.php?search=
"Multimedia" http://<hostname>/wiki/index.php?title=Spezial:Suche&profile=images&search=&fulltext=1
"Everything" http://<hostname>/wiki/index.php?title=Spezial:Suche&profile=all&search=&fulltext=1 S0ring (talk) 16:55, 17 February 2020 (UTC)
- The feature exactly as stated doesn't exist, but the default namespaces to search can be configured via Manual:$wgNamespacesToBeSearchedDefault. You could enumerate all of them there. EBernhardson (WMF) (talk) 20:03, 18 February 2020 (UTC)
Error creating thumbnail: convert: unable to extend cache File too large
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
PDF file with large resolution (~3,500 × 2,500 or larger) in the "Search results" won't show a thumbnail, but the following error:
Error creating thumbnail: convert: unable to extend cache `/tmp/magick-67rYBVGl1cqvlF': File too large @ error/cache.c/OpenPixelCache/4104. convert: no images defined `images/tmp/transform_bc3ca503600a.jpg' @ error/convert.c/ConvertImageCommand/3258. |
Note. Other PDF files with smaller resolution i.e. (~1,300 × 1,750) will show a thumbnail as expected.
From the logs result the error:
File::transform: Doing stat for mwstore://local-backend/local-thumb/2/21/<filename>.pdf/page1-120px-<filename>.pdf.jpg
[FileOperation] FileBackendStore::getFileStat: File mwstore://local-backend/local-thumb/2/21/<filename>.pdf/page1-120px-<filename>.pdf.jpg does not exist.
PdfHandler::doTransform: called wfMkdirParents(images/tmp)
PdfHandler::doTransform: ('/usr/bin/gs' '-sDEVICE=jpeg' '-sOutputFile=-' '-dFirstPage=1' '-dLastPage=1' '-dSAFER' '-r150' '-dBATCH' '-dNOPAUSE' '-q' 'images/2/21/<filename>.pdf' | '/usr/bin/convert' '-depth' '8' '-quality' '95' '-resize' '120' '-' 'images/tmp/transform_bc3ca503600a.jpg')
[exec] MediaWiki\Shell\Command::execute: /bin/bash '/var/www/html/includes/shell/limit.sh' '('\''/usr/bin/gs'\'' '\''-sDEVICE=jpeg'\'' '\''-sOutputFile=-'\'' '\''-dFirstPage=1'\'' '\''-dLastPage=1'\'' '\''-dSAFER'\'' '\''-r150'\'' '\''-dBATCH'\'' '\''-dNOPAUSE'\'' '\''-q'\'' '\''images/2/21/<filename>.pdf'\'' | '\''/usr/bin/convert'\'' '\''-depth'\'' '\''8'\'' '\''-quality'\'' '\''95'\'' '\''-resize'\'' '\''120'\'' '\''-'\'' '\''images/tmp/transform_bc3ca503600a.jpg'\'')' 'MW_INCLUDE_STDERR=1;MW_CPU_LIMIT=180; MW_CGROUP='\'''\''; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes'
[thumbnail] Removing bad 0-byte thumbnail "images/tmp/transform_bc3ca503600a.jpg". unlink() succeeded
[thumbnail] thumbnail failed on 2b475d55050d: error 1 "convert: unable to extend cache `/tmp/magick-67rYBVGl1cqvlF': File too large @ error/cache.c/OpenPixelCache/4104.
convert: no images defined `images/tmp/transform_bc3ca503600a.jpg' @ error/convert.c/ConvertImageCommand/3258." from "('/usr/bin/gs' '-sDEVICE=jpeg' '-sOutputFile=-' '-dFirstPage=1' '-dLastPage=1' '-dSAFER' '-r150' '-dBATCH' '-dNOPAUSE' '-q' 'images/2/21/<filename>.pdf' | '/usr/bin/convert' '-depth' '8' '-quality' '95' '-resize' '120' '-' 'images/tmp/transform_bc3ca503600a.jpg')"
The /tmp
partition is enough free space (14G out of 30G):
# df -h /tmp
Filesystem Size Used Avail Use% Mounted on
overlay 30G 16G 14G 54% /
The problem seems to be on the browser side, while on the line command no error occurs:
# /usr/bin/gs -sDEVICE=jpeg -sOutputFile=- -dFirstPage=1 -dLastPage=1 -dSAFER -r150 -dBATCH -dNOPAUSE -q images/2/21/<filename>.pdf | /usr/bin/convert -depth 8 -quality 95 -resize 120 - images/tmp/transform_ea069aa2269a.jpg
# ls -lrt images/tmp/transform_ea069aa2269a.jpg
-rw-r--r-- 1 root root 3370 Feb 18 09:11 images/tmp/transform_ea069aa2269a.jpg
S0ring (talk) 08:54, 18 February 2020 (UTC)
- You may need to increase Manual:$wgMaxShellMemory or Manual:$wgMaxShellFileSize Ciencia Al Poder (talk) 10:36, 18 February 2020 (UTC)
- Indeed it worked! Thank you! S0ring (talk) 11:36, 18 February 2020 (UTC)
No extension.json for mw 1.31
RESOLVED | |
Extension registration was implemented for MW 1.34 and later. |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Where is the extension.json file for mw 1.31? Legaulph (talk) 13:15, 10 March 2020 (UTC)
- There isn't any. Thus you have to invoke the classic way. Extension registration was only implemented for MW 1.34 and later. [[kgh]] (talk) 15:38, 10 March 2020 (UTC)
Easy install script?
Seems there are quite a few dependencies for this extension. Might you be able to share a way to automate this installation? Thank you! Paradox of Thrift (talk) 18:04, 10 March 2020 (UTC)
Completion Suggester: Missing the "containing..." option
RESOLVED | |
The cause was due to same JS code lines which were by mistake saved in MediaWiki:Common.js |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Could anyone say the cause why the "containing..." option would miss? Surprisingly it shows in Special:Preferences S0ring (talk) 11:46, 11 March 2020 (UTC)
Http error communicating with Elasticsearch
RESOLVED | |
Had the wrong endpoint |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
MediaWiki 1.31.6
PHP 7.3.15 (cgi-fcgi) MySQL 5.6.41-log LinkedWiki 3.3.7 CirrusSearch 0.2 (ad9a0d9) 16:24, 17 April 2018 Elastica 1.3.0.0 (7019d96) 20:49, 13 April 2018
CirrusSearch was working and I can connect to the elasticsearch
D:\xampp\htdocs\mediawiki>curl server.com:9200 --verbose *Trying fe80::a8b7:e5c8:323:6b1d:9200... *TCP_NODELAY set *Connected to server.com (fe80::a8b7:e5c8:323:6b1d) port 9200 (#0) > GET / HTTP/1.1 > Host: server.com:9200 > User-Agent: curl/7.68.0 > Accept: */* > *Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-type: application/json; charset=UTF-8 < content-length: 328 < { "name" : "nqKWfGO", "cluster_name" : "elasticsearch", "cluster_uuid" : "nfxRkxclRwat03NRtCZuhA", "version" : { "number" : "5.6.16", "build_hash" : "3a740d1", "build_date" : "2019-03-13T15:33:36.565Z", "build_snapshot" : false, "lucene_version" : "6.6.1" }, "tagline" : "You Know, for Search" } *Connection #0 to host AWSACRNVA1046.jnj.com left intact
Trying to update I get:
Fetching Elasticsearch version... Unexpected Elasticsearch failure. Http error communicating with Elasticsearch: Operation timed out. Legaulph (talk) 16:09, 20 March 2020 (UTC)
How to list more than 1 result from a wiki page
Hi, I'm using
MediaWiki 1.27.1
PHP 5.5.9-1ubuntu4.22 (apache2handler)
MySQL 5.5.53-0ubuntu0.14.04.1
ICU 52.1
Elasticsearch 1.7.5
and kinda pleased with the search result. However, I have a problem. My wiki has a page "Windows tip" and 2 heading name "windows can't sleep" and "Windows wake from sleep". A search "windows sleep" only bring "Windows wake from sleep" then the result come from another page. How to list list more than 1 result from a wiki page?
PS: I can code a bit, so if this feature not available I can contribute. Chachacha2020 (talk) 04:01, 3 April 2020 (UTC)
[8a4e47bbf50dc37d2271edc5] 2020-04-05 18:03:43: Fatal exception of type "Error"
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Hello everyone!
I just istalled Mediawiki with the extentions carussearch and elasticsearch through the read me file. I did uplode some pages through the xml enwikipediadump and wanted to search a page I knew it was sertntyl in these pages. But then I got this error
[8a4e47bbf50dc37d2271edc5] 2020-04-05 18:03:43: Fatal exception of type "Error"
Has anybody a clue how to fix this ?
Thank you in advance 2003:EE:AF2B:6326:D826:7BD0:EDD2:3D0C (talk) 20:15, 5 April 2020 (UTC)
- Temporarily set $wgShowExceptionDetails = true; in LocalSettings.php to view a more detailed error message. Ciencia Al Poder (talk) 18:34, 6 April 2020 (UTC)
- Obviously no longer and issue. [[kgh]] (talk) 08:37, 19 June 2020 (UTC)
Expose top searches
Is there a way to see the top searches that have been performed over a period of time? PhotographerTom (talk) 19:45, 22 April 2020 (UTC)
- CirrusSearch doesn't have any similar functionality. It does have low-level logging which could be batch processed to aggregate the top searches, but from CirrusSearch's perspective it logs the request and never thinks about it again. EBernhardson (WMF) (talk) 16:55, 28 April 2020 (UTC)
- I guess this is a feature request that could be added as a task to Phabricator? [[kgh]] (talk) 08:38, 19 June 2020 (UTC)
- In my opinion it wouldn't really be a feature request for CirrusSearch. This seems more like asking for an analytics platform to be built into MediaWiki, which CirrusSearch could then piggy-back off of to provide analytics over search requests such as top queries. EBernhardson (WMF) (talk) 17:58, 1 July 2020 (UTC)
redis job
RESOLVED | |
Let's face reality: One needs to set up jobs with Redis. |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Hi,
- If no cache system like redis is available, then what to do in order to use CirrusSearch without job problem and avoid 'Notice'?
- As about JDK installation, where should I install its Linux bin on server?
Farvardyn (talk) 19:33, 27 April 2020 (UTC)
- CirrusSearch and elasticsearch are generally complex software to install and maintain, I would only suggest using it in a fairly advanced scenario. CirrusSearch makes heavy use of the job queue, it will likely only work with a full job queue implementation (like the redis one) installed. Essentially add a job queue to the list of requirements, it's just as essential as elasticsearch. EBernhardson (WMF) (talk) 16:57, 28 April 2020 (UTC)
- Thanks for your assessment! [[kgh]] (talk) 08:35, 19 June 2020 (UTC)
[166e71ff89c6a092549ca318] [no req] MWException from line 310 of mediawiki\includes\parser\ParserOutput.php: Bad parser output text. 5.
RESOLVED | |
Converted database collision to Latin1_bin |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
MediaWiki 1.31.7 PHP 7.3.17 (apache2handler) MySQL 8.0.18 Elasticsearch 5.6.16
I updated from mediawiki 1.31.1 to mediawiki 1.31.7 and now I'm seeing this error with elastic search. I added to ParserOutput.php to display the page name, however I can't find the page with "GTS" or "Standard Kanban Board in Leankit", that makes sense. I also searched for https://server/api.php?action=query&prop=info&pageids=5716 and did not see anything like what is displayed.
[ mykaidevdb] Indexed 10 pages ending at 5716 at 13/second
[166e71ff89c6a092549ca318] [no req] MWException from line 310 of D:\Bitnami\wampstack\apps\mediawiki\includes\parser\ParserOutput.php: Bad parser output text. 5. GTS’ Standard Kanban Board in Leankit
Backtrace:
- 0 [internal function]: ParserOutput->{closure}(array)
- 1 D:\Bitnami\wampstack\apps\mediawiki\includes\parser\ParserOutput.php(320): preg_replace_callback(string, Closure, string)
- 2 D:\Bitnami\wampstack\apps\mediawiki\includes\content\WikiTextStructure.php(152): ParserOutput->getText(array)
- 3 D:\Bitnami\wampstack\apps\mediawiki\includes\content\WikiTextStructure.php(225): WikiTextStructure->extractWikitextParts()
- 4 D:\Bitnami\wampstack\apps\mediawiki\includes\content\WikitextContentHandler.php(150): WikiTextStructure->getOpeningText()
- 5 D:\Bitnami\wampstack\apps\mediawiki\extensions\CirrusSearch\includes\Updater.php(366): WikitextContentHandler->getDataForSearchIndex(WikiPage, ParserOutput, CirrusSearch)
- 6 D:\Bitnami\wampstack\apps\mediawiki\extensions\CirrusSearch\includes\Updater.php(204): CirrusSearch\Updater->buildDocumentsForPages(array, integer)
- 7 D:\Bitnami\wampstack\apps\mediawiki\extensions\CirrusSearch\maintenance\forceSearchIndex.php(218): CirrusSearch\Updater->updatePages(array, integer)
- 8 D:\Bitnami\wampstack\apps\mediawiki\maintenance\doMaintenance.php(94): CirrusSearch\ForceSearchIndex->execute()
- 9 D:\Bitnami\wampstack\apps\mediawiki\extensions\CirrusSearch\maintenance\forceSearchIndex.php(679): require_once(string)
- 10 {main} Legaulph (talk) 12:27, 1 June 2020 (UTC)
- Figured out the issue!
- I was using utf8 for the database and converted it to latin_bin and everything started working. Legaulph (talk) 13:49, 2 June 2020 (UTC)
Java version compatibility
I've moved our ElasticSearch server to an another machine, which is running Ubuntu 20.04 and where were installed the package default-jre
, which contains Java 11 and the package elasticsearch-6.5.4.deb
. The MediaWiki's version is 1.34 and the port 9200 is forwarded via SSH.
Then I've rebuild the search index according to the instructions provided in the README file. Everything went well.
Unfortunately when I tried to use the search feature of the wiki, by the web interface, I received the message: We could not complete your search due to a temporary problem. Please try again later.
After a while, I found the ElasticSearch service is dead, with the following reason:
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely...
So in order to get ElasticSearch operational I've switched to Java 8 (reference) by using the following commands:
sudo apt install openjdk-8-jre-headless sudo apt install openjdk-8-jdk-headless sudo update-alternatives --config java sudo update-alternatives --config javac sudo systemctl restart elasticsearch.service curl 'http://127.0.0.1:9200' # do a test
Now everything works great!
I do not know which is the trouble maker Extension:CirrusSearch
or the ElasticSearch
service, but I think it will be meaningful to include some additional information of the compatibility with the different Java versions.
Regards! Spas Spasov Spas.Z.Spasov (talk) 07:04, 12 June 2020 (UTC)
- Thanks a lot for sharing this information. Indeed, having a Java compat overview will be great. Perhaps it is already there in the abyss of the extension's (code) docu.
- Anyhow the bottom line appears to be that Java 8 is required for recent versions of Cirrus. [[kgh]] (talk) 10:12, 12 June 2020 (UTC)
- @Kghbln, I'm happy to do that :) Here is small update:
- With my setup
elasticsearch-6.5.4.deb
constantly crashes after a few hours of work. So I switched back toelasticsearch-5.6.16.deb
and it works without problems and need of restart for about a week yet. Despite of within the extension's documentations is written MediaWiki 1.33.x and 1.34.x require Elasticsearch 6.5.x. - Another thing that I remembered, when I started to use Extension:CirrusSearch I wasn't able to made the initial search index, unless changing the MySQL's database name from
myWiki
tomy_wiki
(without capital letters). Spas.Z.Spasov (talk) 07:13, 19 June 2020 (UTC) - Thanks again for keeping us updated about your experience. I find it strange that ES 6.5.4 works with JDK 8 on Ubuntu 18.04. without issues whereas it appears that you need to use ES 5.6.16 with JDK 8 on Ubuntu 20.04. However you cannot beat reality. Did you track why ES was failing? Probably good to know and report.
- The compatibility table in the documentation is based on what the developers of CirrusSearch think it should work with. If there is unofficial compatibility this is even better.
- About the database name: I also ran into this earlier and found a then undocumented configuration parameter. Thus you could have done
$wgCirrusSearchIndexBaseName = 'mywiki';
to avoid renaming the database name. I added an info about it directly to the extension's page rather than linking to many spots the explore the whole lot. :) [[kgh]] (talk) 08:08, 19 June 2020 (UTC)
Search result: totalhits count is miss matched from total result row
Result miss match.
Need some help here, totalhits count is miss matched from total result row, could you please help me here? why this is different is there any confirmation need to run or set?
{
"batchcomplete": "",
"warnings": {
"search": {
"*": "Unrecognized value for parameter \"srnamespace\": LL:."
}
},
"query": {
"searchinfo": {
"totalhits": 6
},
"search": [
{
"ns": 3024,
"title": "Debris Gas System",
"pageid": 6048,
"size": 2243,
"wordcount": 291,
"snippet": "Debris was found in the fuel gas system",
"timestamp": "2018-10-24T17:46:48Z"
},
]
}
} 147.1.18.25 (talk) 08:00, 16 July 2020 (UTC)
totalhits
is the total number of hits found in elasticsearch, it might vary from what you see for multiple reasons:- the number of pages returned is controlled by a limit param (e.g.
srlimit
), in your example if you've set srlimit=1 this result looks perfectly normal - the index is not up to date, if for some reasons the process responsible for keeping the index up to date did not function properly then some pages that have been deleted might be in the elasticsearch index but are filtered-out when displaying the results back to the user leading to such inconsistencies in totalmatch and what you could see in the results.
- the number of pages returned is controlled by a limit param (e.g.
- If you believe you are affected the second point try to re-sync your index using the
maintenance/saneitize.php
(or Saneitize.php if using a recent version of CirrusSearch). If this fixes the issue you should try to understand what happened so that it does happen again. DCausse (WMF) (talk) 09:12, 16 July 2020 (UTC) - @DCausse (WMF) - Thanks for your response.
- I couldn't found the saneitize.php, i would like to inform you i am using MW-1.31 Rajeshrajesh.35 (talk) 07:17, 20 July 2020 (UTC)
- The script can be found in the CirrusSearch folder under the
maintenance
directory. - This is in the same folder that you found the scripts that you must have used setup CirrusSearch in the first place (https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CirrusSearch/+/refs/heads/REL1_31/maintenance/).
- Are you using CirrusSearch with the REL1_31 branch as well? DCausse (WMF) (talk) 07:39, 21 July 2020 (UTC)
Elasticsearch version for upcoming MediaWiki 1.35
Which version of elasticsearch will be required with the upcoming MediaWiki 1.35?
Latest Elasticsearch release is 7.8.1, but the CirrusSearch README still says "6.5.4 or higher". Does the "higher" also include 7.8.x, or should I stay with 6.5.x? Cboltz (talk) 20:20, 30 July 2020 (UTC)
- Elastic 7.x is not yet supported, WMF is running 6.5.4 but I believe newer versions in the 6.x branch might work as well. If you plan to use WMF elasticsearch plugins then I'd suggest to use 6.5.4. DCausse (WMF) (talk) 07:58, 31 July 2020 (UTC)
- ElasticSearch server version 6.8 seems to work fine with MediaWiki 1.35. But the ElasticSearch PHP client version 6.8 causes issues. The CirrusSearch maintenance scripts call certain classes that were removed in the 6.8 branch of the ElasticSearch PHP client. The 6.7 branch seems OK. Ike Hecht 18:21, 5 March 2021 (UTC)
- Tracking in T276854. Ike Hecht 21:16, 8 March 2021 (UTC)
Determine the search backend used
RESOLVED | |
Either append '&cirrusDumpQuery' to the search URL for a data-dump in json-format or alternatively use a cirrus specific search keyword, e.g. hastemplate:foo |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
How can I determine if the results on Special:Search are returned by Cirrus and not standard MediaWiki? [[kgh]] (talk) 08:27, 25 August 2020 (UTC)
- If it's only for debugging purpose appending '&cirrusDumpQuery' to the Special:Search URL should dump a json containing the elasticsearch query if CirrusSearch is being used. If adding such param does not change the output then the standard MW search (or another search engine) is being used.
- If it's not for debugging purpose then I doubt there is an obvious way to determine this from the UI. DCausse (WMF) (talk) 14:01, 25 August 2020 (UTC)
- You could also use some cirrus specific keywords and see if they work: e.g.
hastemplate:foo
. DCausse (WMF) (talk) 14:03, 25 August 2020 (UTC) - Perfect. This really helped. Indeed this was to check after an upgrade if things are still working and not for continuous monitoring. There was an issue for the interaction with another extension which caused Cirrus to fail initially. All of these worries: gone.
- Both ways to determine the backend are great. Admittedly I could have come up with the second option too. In the heat of the action ... :) [[kgh]] (talk) 10:10, 26 August 2020 (UTC)
Elasticsearch not up
I can't seem to find much information on how to fix this, but when I am running the maintnance scritps, I get "Elasticsearch not up". I've got both elasticsearch and CirrusSearch extentiosns active. 68.110.86.107 (talk) 05:38, 3 October 2020 (UTC)
- Is the ElasticSearch service up and running? Is it accepting connections from MediaWiki? Be sure settings in MediaWiki regarding connection to ElasticSearch are correct and there's no firewall blocking the connections Ciencia Al Poder (talk) 11:09, 3 October 2020 (UTC)
- Do you know how I can tell that? 68.110.86.107 (talk) 23:18, 3 October 2020 (UTC)
- I suspect you haven't installed an ElasticSearch service. In that case, you better read the getting started guide: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html Ciencia Al Poder (talk) 11:17, 4 October 2020 (UTC)
- I have the elastica extension installed, and it shows running in versions. I guess it doesn't matter, I don't think I'm going to pay for the enhanced search for my small hobby site.
- Are you aware of any similar free alternatives? 68.110.86.107 (talk) 16:05, 4 October 2020 (UTC)
- ElasticSearch is a service. A daemon. Just like a webserver (apache, nginx...). You can install and configure it on your server for free, if you know/learn how to do it.
- There are no free ElasticSearch hosting alternatives AFAIK Ciencia Al Poder (talk) 17:32, 4 October 2020 (UTC)
- probably can't install it on my shared server at bluehost, but I'll do some research. Thanks. 68.110.86.107 (talk) 03:29, 5 October 2020 (UTC)
- Really, it is really difficult to know if everything is good. really not clear . "Is the ElasticSearch service up and running? Is it accepting connections from MediaWiki? Be sure settings in MediaWiki regarding connection to ElasticSearch are correct and there's no firewall blocking the connections" => how can you check each of these steps ??? Sancelot (talk) 08:10, 16 October 2020 (UTC)
- Installing and configuring an ElasticSearch instance is complex and requires some background knowledge about services and networking, necessary not only for setting it up, but for maintaining it in the long term. If you lack this background knowledge, you'll need to familiarize yourself first with those concepts, or hire someone that can set it up for you. Ciencia Al Poder (talk) 11:16, 17 October 2020 (UTC)
ForceSearchIndex.php isn't populating ES
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Recently migrated a wiki from one server to another, everything seems to be working fine except for Cirrus.
When re-running ForceSearchIndex.php in order to repopulate the fresh elasticsearch install after UpdateSearchIndexConfig etc, I doesn't seem to be populating anything.
[ x] Indexed 9 pages ending at 946 at 55/second
[ x] Indexed 10 pages ending at 968 at 85/second
[ x] Indexed 9 pages ending at 986 at 99/second
[ x] Indexed 10 pages ending at 999 at 50/second
[ x] Indexed 10 pages ending at 1012 at 58/second
[ x] Indexed 10 pages ending at 1024 at 66/second
[ x] Indexed 10 pages ending at 1035 at 72/second
[ x] Indexed 10 pages ending at 1046 at 79/second
[ x] Indexed 10 pages ending at 1058 at 84/second
It runs through the pages as if it was correctly doing it and finishes with no errors, however attempting to do a search will always return no results.
And after running CirrusNeedsToBeBuilt.php just to verify there's data, I'm getting
Elasticsearch status: green
No pages in the content index. Indexes were probably wiped.
Is there something I've done wrong?
Thanks! Corin12355 (talk) 13:36, 24 October 2020 (UTC)
- You need to set $wgDisableSearchUpdate = true; before running updateSearchIndexConfig.php, then set $wgDisableSearchUpdate = false; before running forceSearchIndex.php, which needs to be run in 2 stages (--skipLinks --indexOnSkip first, and --skipParse next) Ciencia Al Poder (talk) 21:47, 26 October 2020 (UTC)
- Performed that before but tried it again just to be sure I didn't mix up anything and same deal sadly.
- Forgot to mention my versions as well:
MediaWiki: 1.35.0 (c1e34e3)
PHP: 7.3.19-1~deb10u1 (fpm-fcgi)
MariaDB: 10.5.6-MariaDB-1:10.5.6+maria~buster
Elasticsearch: 6.5.4
- I downgraded from 6.8.12 to the same version Wikipedia uses just to be sure it wasn't that and unfortunately there was no difference either.
- Appreciate the help though! Corin12355 (talk) 16:15, 28 October 2020 (UTC)
- Mystery solved. "Indexed x pages" was a bit misleading, it had actually added them to the job queue, not actually indexed them.
- The problem I had was the cronjob for the queue wasn't moved over with the website (forgot about it), as a result nothing was being submitted to ES! Corin12355 (talk) 17:31, 18 November 2020 (UTC)
Elasticsearch won't index documents
RESOLVED | |
Problem was the jobqueue-runner. Had $wgJobRunRate = 0; while the jobqueue service was not up properly..... |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Hi, I am following the instructions to install cirrussearch. I set disableSearchUpdate, then run the maintenance script, then unset remove disableSearchUpdate from LocalSettings and then run the two other maintenance scripts. Those scripts seem to work, at least the output seems to suggest so. I get an output like this:
[ mydb-pmw_] Indexed 50 pages ending at 108 at 300/second
[ mydb-pmw_] Indexed 48 pages ending at 158 at 319/second
[ mydb-pmw_] Indexed 50 pages ending at 208 at 335/second
[ mydb-pmw_] Indexed 50 pages ending at 259 at 349/second
[... Etc ...]
Now if I try to search it comes out empty always, if I try to append '&cirrusDumpQuery' to the search query it comes out empty too. Looking at Kibana (GET /_cat/indices?v
):
health status index ------------------------- uuid pri rep docs.count docs.deleted store.size pri.store.size
green open mydb-pmw__general_first --------- XXX 4 0 0 0 1kb 1kb
green open mydb-pmw__archive_first --------- XXX 4 0 0 0 1kb 1kb
green open pmw_cirrus_metastore_first ------ XXX 1 0 29 6 9.7kb 9.7kb
green open .kibana_1 ----------------------- XXX 1 0 3 0 11kb 11kb
green open mydb-pmw__content_first --------- XXX 4 0 0 0 1kb 1kb
green open .kibana_task_manager ------------ XXX 1 0 2 0 12.5kb 12.5kb
So, the output seems like the maintenance script is indexing whereas in kibana and appending &cirrusDumpQuery it seems like nothing gets indexed. Am I missing something obvious?
Thank you John Bird Jr (talk) 03:00, 17 November 2020 (UTC)
Installing with Composer
RESOLVED | |
I updated composer and installed ext-curl. Thereafter installation from the "/extensions/Elastica" directory worked OK. |
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
The installation instructions for Elastica mention running Composer "by issuing composer install --no-dev in the extension directory". I assume this means "/extensions/Elastica" , but running "Composer install --no-dev" from there gives this response: "Your requirements could not be resolved to an installable set of packages". Henryfunk (talk) 17:55, 22 December 2020 (UTC)
How can I search across multiple namespaces by default?
As it stands it looks like Cirrus Search will only search the (Main) namespace by default. I'd prefer it to search across (Main) and another custom namespace. Is there a way to configure this to happen? The namespace in question is included as a content name space and I have added it to the default search with wgNamespacesToBeSearchedDefault, but I do not get results from that namespace. Blinkingline (talk) 20:36, 28 December 2020 (UTC)
- Hi,
- when changing this setting you should do a complete reindex from source documents:
php maintenance/UpdateSearchIndexConfig.php --startOver
php maintenance/ForceSearchIndex.php
- Beware that it will recreate your elasticsearch indices.
- If your wiki is not huge (fewer than a couple thousand pages) you might prefer to run the
maintenance/Saneitize.php
maintenance script to align the search indices with the new defaults. DCausse (WMF) (talk) 10:12, 4 January 2021 (UTC)