html.duckduckgo.com captchas all my IPs very fast. I figured out that using
duckduckgo.com works even if html.duckduckgo.com is captcha-ed, hence adding
support for duckduckgo.com's general web search here.
This implementation fetches the link to the first API page
(i.e. ``links.duckduckgo.com/d.js?...``) from duckduckgo.com and uses the ``n``
parameter of the API to fetch all subsequent pages.
This also means that it's not possible to immediately search for the third
page - the first and the second page would need to be loaded first.
The reason why we can't just normally use the `vqd` value is that the API URLs
require an additional parameter `dp` which seems generated at server-side, so we
can't build it ourselves and must scrape it from the HTML pages.
The initialization of the DB schema ("base schema") has so far been done on
demand, which causes race conditions with competing threads and processes.
The DDL statements for creating the "base schema" are now executed as part of
the initialization of the app.
Further improvements were made to harden the database applications:
- Wikidata & Radio-Browser engine perform their initialization only once (so far
the initialization was carried out in each thread/process).
- If multiple processes try to set DB's WAL mode when opening the DB at the same
time, this usually leads to another race condition, which is now also caught.
Related:
- https://github.com/searxng/searxng/issues/6181#issuecomment-4586705Closes: #6181
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Add support for https://tiger.ch (general, news)
It is disabled and inactive by default because it's just a metasearch engine
like SearXNG is, so it's mostly useful for bypassing rate-limits on other
engines: (it has its own German index, but it's not that great) in theory it
supports different locales, but I was too lazy to implement that (I only need
German and English results anyways, which are returned by default...)
Add support for https://seek.ninja (general)
It's very slow because the engine uses Server-side events, that incrementally
send data in their HTTP response [1].
I.e. we wait for the end of the response (7+ seconds), even though the results
data arrives within a few seconds -> it's very slow, because SearXNG wants to
get the full response body before it calls the `response(resp)` method
We could use httpx-sse [2], but I'm not sure how to integrate this into SearXNG
and if it's worth it
[1] https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/
[2] https://github.com/florimondmanca/httpx-sse
The engine is using very aggressive Cloudflare blocking for
a while now, no matter if using a normal browser like Firefox
or not.
Closes: https://github.com/searxng/searxng/issues/5976
On the first page of the WEB search, there are, among other things, sections for
videos and news. The video results from these sections should not be used as
results in the WEB search of SearXNG.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
PodcastIndex.org started using a Proof-of-Work JavaScript
challenge whose results are sent as `X-Pow-*` request headers.
Although it is technically possible to re-implement the
PoW challenge in Python, it's likely impossible to maintain
because
- the actual Proof of Concept logic might change very often
- the whole idea of the Proof of Work challenge is to use
a "big" amount of resources (about 1s on my PC); so executing the challenge
would almost block all other work on the SearXNG instance
At first glance, the challenge looks very similar to what
Anubis does, because it also uses SHA256 hashes.