URLhaus File Grab
For October, I’m back at it again with leveraging public submission sites for malware samples. This month I am using results from URLhaus, which is self described as ‘a project from abuse.ch with the goal of sharing malicious URLs that are being used for malware distribution.’
https://github.com/triw0lf/urlhaus_scripts/blob/master/urlhaus_badfiles.sh
URLhaus has a wealth of research and analysis features that they make available for free. The most exciting part? As long as you don’t want to submit URLs, you don’t need an API key! Once again we find there are amazing free resources with a relatively low barrier to entry for API usage and research purposes. URLhaus provides fantastic documentation that I would recommend reviewing outside of this script.
This script will be using the database dumps that URLhaus publishes every five minutes. These come in CSV format and you can preview what the data looks like here. When you are testing, make sure you don’t pull this CSV more than once every five minutes!
Once the script grabs all online submissions from the URLhaus Online Database Dump, it will format the CSV using the delimiters and check for entries from the current date. The script will also pull out any submissions ending with files, and not directories. This reduces the number of false positive index files you will receive. Once those results are matched, the script will sort by unique submissions and pass the date and raw submission URL to a new holding file. The holding file will then be read into a while loop and attempt to wget the file. The sample download portion of this script has been optimized to emulate standard endpoints, instead of looking like an obvious research server. Specific wget headers have been added to make the user agent appear to be Google Chrome on Windows 10, and be an English speaking host. If you run into malware that claims to be online but can’t be downloaded, I recommend playing around with other common header variations, which I included in the script.
For resources on changing your header values, I recommend the following:
Before using this script, make sure you have the following set up:
A research server for collection
Some sort of malware analysis lab or automated malware ingestion framework
(NOT REQUIRED) Alias set for script
A sneak peak into what running the script looks like:
You can find me at my Contact page or on Twitter - @jotunvillur. I’m open for questions, feedback, or general chitchat about security!