How can Malware be detected by using Machine Learning algorithms?

Posted by

In the recent post, we discussed the basics of malware-What is malware? Types of malware and recent malware statistics published by Kaspersky Lab and McAfee Labs Threats Report. Click here to visit the Introduction to Malware post.

Malware, which is abbreviated for “malicious software,” can infect your computer to the limit where it collects your personal data, gains access to programs or systems on your network and prevent your computer from running efficiently. Malware can be of various types such as a virus, ransomware, worms, rootkits, keyloggers, and Trojans. Individuals and organizations heavily rely on security mechanisms such as antivirus, antimalware and anti-ransomware for protection against malware. However, the methodologies used by this software are not enough to detect and prevent maximum of the malicious activities furthermore it consumes a huge amount of resources from the host machine for their normal operations.

All malware detection techniques can be classified into signature-based and behavior-based methods.
The signature-based analysis is a static method that relies on pre-defined signatures. A signature can represent a sequence of bytes that is unique to a program, as well as a cryptographic hash of a file and its contents. Antivirus tools attempt to identify the presence of malware by comparing hashes and bytecode patterns of files on a system against a repository of signatures of known malware. The signatures obtained have to be added to a list that is pushed to clients. From a client’s perspective, this means constant updates and slow response to new malware variants. These techniques are not very helpful when there is an attack from a new or unknown malware, which is why there is a huge gap in the industry, despite several studies in this area.

Therefore, antivirus vendors had to come up with another way of detection – behavior-based. In this method, the actual behavior of malware is observed during its execution, looking for the signs of malicious behavior: modifying host files, registry keys, establishing suspicious connections. Nevertheless, given that the execution of a malicious sample can lead to undesired consequences (e.g., file deletion and modification, loss of confidential information, and changes in the registry), dynamic analysis must be performed in a safe environment or sandbox, which is a confined execution system that can be used to isolate and run unreliable software and observe its malicious behavior. By using a sandbox, it is possible to perform dynamic analysis without worrying about the changes that will occur during the execution of the suspicious sample. The disadvantage of dynamic analysis is that it often requires a significant amount of time (e.g., five minutes) to execute malicious samples, which poses a challenge as quickly performing the dynamic analysis is essential for coping with the ever-increasing load of malware that appears every day.

Therefore, in recent developments, the existing techniques are combined with machine learning techniques to identify malicious activity in a system. For the system to understand whether the file is safe or malicious, it must first learn about the apparent features of both the types. Therefore, it is important to apply existing and relevant machine learning algorithms to train the system. It will not only enhance systems awareness about malicious and legitimate files, but it will also reduce the time and resources consumed while detecting the malware.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s