Extension:TikaAllTheFiles
The TikaAllTheFiles (TATF) extension facilitates full-text search over uploaded files, by using the Apache Tika content analysis toolkit, which "detects and extracts metadata and text from over a thousand different file types".
In practical terms: if you already have Extension:CirrusSearch set up and working on your wiki, TATF will allow you to perform full-text searches over the contents of almost any uploaded file --- not just the PDFs.
TATF's features and capabilities:
- extract embedded digital text from any type of uploaded file so that it can be indexed for full-text search;
- extract and index printed text from bitmap image files and from images embedded in document files, e.g., image-only PDF's (requires Tesseract OCR;
- extract metadata from any type of uploaded file for display on
File:
pages; - index metadata properties along with text, to enable simple searching for properties within full-text search.
Installation
This extension can be installed using composer
.
The complete installation and configuration instructions can be found in README.md.
Configuration parameters
The complete description of configuration parameters can be found in README.md.