API, Business

We’ve Update the Malware API to Identify Real-Time Abuse

In May of 2024 we introduced our Malware API, and we briefly introduced how we'd use the facility to mitigate the onslaught of illicit web traffic looking for server and plugin vulnerabilities that might potentially be exploited. While the service has made a significant impact on blocking malicious traffic, it was far from perfect, and an incident last year resulting from an out-of-date (third-party) website plugin caused problems that once again forced us to rethink our approach. While the Malware API is excellent, it lacked the scalability to detect traffic in real-time and catch intrusions made with clear malicious intent, and - while it isn't perfect - the Abuse API rectifies this with a scalable and unforgiving system.

First, it needs to be stated that no server is entirely secure, and there's thousands of requests made to your website every day from malicious actors looking to exploit any possible vulnerabilities. Regular off-site backups are essential to ensure you can restore your website after it becomes compromised... and your website will require restoration at some point.

The rise of AI has seen a ridiculously large number of crawlers are presumably used to train various AI models. We first noticed the problem a few years ago and progressively started to blacklist IP addresses associated with those that consumed or abused excessive server resources, but this wasn't enough. There's been a significant rise in traffic that is looking to clearly access vulnerabilities, and there's a rise in the number of commercial systems that do little other than archive data that can be sold to others (providing you absolutely nothing in return). In the former camp we have thousands of Chinese AI training bots, and in the latter category we have (but are far from limited to) groups such as SEMrush and Ahrefs that indiscriminately crawl websites without regard to server instructions, robot.txt files, or common decency.

There's really only a handful services that we need to be openly whitelist, such as Google, Bing, LinkedIn, Twitter, and Facebook. Everything else is noise, and everything else is a service using your data and our resources without permission or purpose, and the net effect is that they seriously degrade server performance and website speed. To provide an example of how blocking nonsense traffic can improve server performance, before we switched on the Abuse API we had over three-million page requests a day from those that were draining our resources for their own purposes - legitimate or otherwise, but certainly without any benefit that could offset the resulting page speed - but when Abuse was activated, those requests were muted immediately.

AI Crawling: AI crawling is becoming a big problem. It's not uncommon to have multiple AI bots (that aren't identifying themselves as such with each request) making multiple calls to pages simultaneously. Because each of our broker websites all have well over 50-million pages, the resulting API requests and caching slowly began to cause major problems (caching alone resulted in terabytes of data). Despite mechanisms in place to prevent abuse, most crawlers simply ignore the standard instructions and do whatever they want. This disregard for crawling etiquette applies to some well-known companies such as Ali Baba who have crawled millions of pages in just the last few weeks.

Permitted Crawlers: Even permitted crawlers, such as Google, Bing, and Facebook, often crawl at a rate and in page numbers that we'd consider unacceptable. In a scheduled plugin update we've made provisions to exclude the Streets Module and other 'general' resources from being crawled from any identified bot.

What we've described with generic crawlers is one part of a bigger picture. Requests made to the server to specific 'prohibited' files, certain URL patters, multiple requests to the same page in a short period of time, or requests made to multiple pages in a short period of time, can all be considered malicious, and it's here where we take no prisoners: we immediately ban that IP based on resolved intent. It's this secondary abuse detection 'firewall' where the real magic happens.

Existing Security Plugins: Why aren't we using an existing security plugin? Security and Malware plugins are normally extremely bloated with upsells and other sellware that is itself a 'type' of malware. Because of their open source availability they're also prone to exploits themselves, such as the popular Really Simple Security which recently impacted over 4-million users. Because these plugins make requests on every page load, they're notoriously slow (albeit essential), and can often cause pages to hang. Most security plugins charge a premium subscription which is just another cost, so it's obviously far more attractive for us to simply bundle the premium services into your website and/or hosting package. There are some good Malware plugins that we might call upon from time-to-time, such as gotmls.net, but they're a reactive tool rather than any real-time blocker. Bottom line: we simply prefer to maintain control over all our tech.

Xena: A large number of responses were added to the Xena API which returns data that seeks to evaluate sources as abusive. This includes responses for high-frequency page views, illogical browsing patters, excessive views to certain modules (such as BSB or Street data) in a single sessions, or perhaps browsing via an IP host that is known to have visited at least one other website. Much of this data is muted once the source is blocked, but 'other' data sets are available when querying behaviour outside of our defined parameters. Similar data responses are available via the Malware API.

We've built a large number of detection algorithms, and we've planned for a number of others in the short term, but the current net result observed after we activated Abuse is that server load immediately dropped by over 50X. It's picked up a little since then, but page speed is significantly faster.

Discussion Video: Since we first introduced the layer of protection to clients a couple of weeks back, we've made seriously significant changes. The results of the program were so positive that we immediately invested resources into product development. Many of the changes were made on the basis of early client feedback.

The Abuse API

The Abuse API is available as a standard RESTful API. The unpublished endpoint (not requiring authentication) returns a standard JSON response or a numeric value of 1 or 0. The results of the bans, IP scans, and reasoning is returned via the standard Malware API. If not hosted by us, you simply provide a GET request to the applicable endpoint with the IP address, page URL, and forwarder/referrer as URL parameters.

Abuse Data is Shared: All detected abuse of any kind results in the applicable IP address sent to an open repository that is accessed by a large number of others. This same source of data is routinely synced with our own server.

The Matrix API

The Matrix API is our own service that will return various results of every known broker website in the Australian industry. Used to index the 'Financial Web' for research, educational, and comparative purposes, it's also used to feed BeNet (our own AI) and other systems... but it is a bot. We take great care to ensure we don't make more than one single request to any website in any three-minute period, and we observe all robots.txt and other server instructions. We clearly identify the bot as a bot, and the header includes our details for Matrix removal. It's worth mentioning this given I've directed a great deal of criticism to 'other' bots, but it's also worth noting that brokers have full and complete access to Matrix data that can be used to improve upon their web presence.

Whitelisting IP Addresses and Hosts

Clients will want to whitelist their own IP address to ensure that they don't block themselves. Because we observe strict banning criteria it's quite easy to have yourself banned if behaviour indicates abuse of any type, so you may whitelist your own IP from within Yabber.

To whitelist an IP address of host, navigate your way to the 'Security' panel from within the Website module and select the padlock icon. Enter an IP or Host on the applicable page and you'll be whitelisted (your current IP will be shown by default).

Pictured: Enter an IP address and submit. If you ever see a '403 Not Authorised' message in your browser, this is the action you'll need to take. You may also submit a host (all Xena IP addresses are resolved to a host and then measured against a whitelist of permitted sources). Abuse runs at the server level before a page is loaded, so any ban applies globally. All whitelisted requests are immediate but later evaluated manually.

In a future update to Yabber we will require you whitelist any IP address used to log into your website. We may also supplement with an authentication application or SMS verification code but we're weighting options of convenience so we don't end up making the simple process of logging into your website anything other than a simple process.

In the time we wrote the above paragraph, and before publishing this article, we've been given reason to update the system requiring you whitelist your IP address before you log into your website. You run a financial website, so we all need to adhere to the highest of standards... so, while a pain, it's necessary.

Why Force the Whitelist Requirement?: We forced the whitelist requirement only after scanning a sample size of a few hundred websites other than those on our servers. A massive 27% of them returned a positive hit for active malware. One aggregator returned a positive malware detection for every website under their control. So, the requirement came from necessity - we don't want to gamble with your site's integrity or 'take chances' when the massive advantages can be offset by way of a mild inconvenience. Our clients hold themselves to a higher standard than others.

The Abuse Website Plugin

We'd encourage any business looking for another layer of security to their website to make contact with us for access to a new 'Abuse' website plugin. Most malware plugins slow websites down considerably and include thousands of lines of code, but ours is extremely lightweight at only 20 lines of code. Reporting facilities are provided by way of the Malware API.

Conclusion

Since implementing this new programmatic layer of protection we've managed to eliminate almost all abuse. However, we will experience false positives and we will likely block legitimate sources of traffic based on an aggressive assessment of behaviour. If you encounter any errors, please make contact with us so we can update the system. Abuse is new, it isn't perfect, and it'll be a project that continues to be developed.

■ ■ ■

Download our complimentary 650-page guide on marketing for mortgage brokers. We'll show you exactly how we generate billions in volume for our clients.