• Open Access

Privacy theft malware multi-process collaboration analysis

Authors

  • Lejun Fan,

    1. Research Center of Web Data Science and Engineering, Institute of Computing Technology, Chinese Academy of Science, Beijing, China
    Search for more papers by this author
  • Yuanzhuo Wang,

    Corresponding author
    1. Research Center of Web Data Science and Engineering, Institute of Computing Technology, Chinese Academy of Science, Beijing, China
    • Correspondence: Yuanzhuo Wang, Institute of Computing Technology, Chinese Academy of Science, Beijing, China.

      E-mail: wangyuanzhuo@ict.ac.cn

    Search for more papers by this author
  • Xueqi Cheng,

    1. Research Center of Web Data Science and Engineering, Institute of Computing Technology, Chinese Academy of Science, Beijing, China
    Search for more papers by this author
  • Jinming Li,

    1. Research Center of Web Data Science and Engineering, Institute of Computing Technology, Chinese Academy of Science, Beijing, China
    Search for more papers by this author
  • Shuyuan Jin

    1. Research Center of Web Data Science and Engineering, Institute of Computing Technology, Chinese Academy of Science, Beijing, China
    Search for more papers by this author

ABSTRACT

Privacy theft malware has become a serious and challenging problem to cyber security. Previous methods are of different categories: one focuses on the outbound network traffic and the other one dives into the inside information flow of the program. We incorporate dynamic behavior analysis with network traffic analysis and present an abstract model called Privacy Petri Net (PPN), which is more applicable to various kinds of malware and more understandable to users. In consideration of the multi-process technique adopted by new malware, we also model the collaborative behaviors between different malicious functionality modules with PPN. We apply our approach to real-world malware, and the experiment result shows that our approach can effectively find categories, content, source, and destination of the privacy theft behavior of the malware sample. Copyright © 2013 John Wiley & Sons, Ltd.

1 INTRODUCTION

The development of Internet increases the relationship of people and makes our life convenient, but meanwhile, much private information is stored and sent without proper protection. Such privacy data can be stolen by a malware author for economic purpose. Thus, privacy theft malware become a serious and challenging problem to cyber security.

However, anti-virus software (AVS) companies do not value privacy theft malware properly and do not have enough capability to handle them as well. Privacy theft malware can avoid the detection and threaten the privacy security of the users. On one hand, some privacy theft behaviors are not identified as malicious by AVS; on the other hand, some privacy theft malware samples reported by AVS are not given enough detail information so that the user cannot understand the threat level. In addition, multi-process technique adopted by new malware family can avoid the detection of many traditional AVSs, which focus on single process malware [1, 2].

Recently, research works follow two kinds of road map: the black box focuses on the input data and outbound network traffic of malware, whereas the white box dives into the inside information flow of malware. The approaches that focus on network traffic can rapidly find privacy data such as credit card number whose data format are well defined [3, 4]. But these approaches encounter the limitation of packet obfuscating techniques such as encrypted connections, message reordering, and traffic randomization [4, 5]. The other approaches, which focus on insider information flow, can analyze more accurate details about privacy leak behaviors [6, 7]. These analyzing methods can be further divided into static analysis and dynamic analysis. The static analysis method finds accurate data flow from binary executable file of the target malware [8, 9]. But this method also confronts code obfuscating problems such as code morphing, packer, and opaque constant [10, 11]. Thus, the dynamic method, which gets the runtime data flow by tracing the execution of the target malware, is adopted more widely [12, 13] for malicious behaviors analysis. Furthermore, in consideration of the multi-process technique adopted by new malware [1, 2], people also try to understand the behavior of the multiple collaborative processes [14, 15].

Our goal in this paper is to use the dynamic analysis method to detect the privacy theft malware. Moreover, we monitor the related processes of the suspicious process to discover the multi-process collaborative behavior of privacy theft malware. We divide the privacy theft behavior into four categories: normal file data theft, application-related data theft, system-related data theft, and dynamic input data theft. (1) In the first category, privacy data, which are stored statically as text file, image file, audio file, and video file, are stolen. (2) In the second category, application-related privacy data, such as the browse history of web browsers and the play list of media players, are stolen. (3) In the third category, system data, such as system profile, user information, and application configuration, are stolen. (4) In the final category, private data that are dynamically input, such as keystroke, mouse click, and the video captured from the camera, are stolen. All of these privacy data are gathered and sent to remote servers covertly without user's informed consent. We consider three kinds of multi-process work mode of malware. (1) Relay race mode splits its main functionality flow into two or more parts and creates an independent process for each part, respectively. These processes work sequentially as a “relay race.” (2) Master slave mode uses the master process to create and control the slave processes, and recreates new slave processes when old slave processes are detected and terminated by the user or the AVS. (3) Dual active mode uses two or more processes with the same functionality to work simultaneously and watch over each other; if any process is terminated, the other processes immediately take over the current work and restart the crashed process.

For modeling the aforementioned behavior of multi-process privacy theft malware, this paper is based on an abstract model called Privacy Petri Net (PPN) we presented before in [16] to characterize the whole privacy theft procedure by a more high-level description and make our analysis result more applicable to various malware and more understandable to users. PPN is a kind of high-level Petri Net that focuses on privacy theft and has three main features: Firstly, PPN provides formal mathematical definitions of syntax and semantics for privacy attributes and functions calculation. Secondly, PPN has concise and powerful modeling primitives for graphical abstraction of privacy theft procedure. Finally, PPN is modularized and can be used to build various hierarchical models. These modules abstract different malicious functionalities that are accomplished by multiple collaborative malicious processes.

We apply our approach on different kinds of real-world malware. The experiment results show that our approach can not only effectively detect privacy theft behavior of different kinds of multi-process collaboration but also find the categories, content, destination, and theft procedure in detail.

The rest of this paper is organized as follows: Section 2 introduces the background of privacy theft problem. Section 3 shows the detail of PPN and modeling of different privacy theft categories with PPN. Section 4 discusses the experiment results on real-world malware. Section 5 concludes the paper.

2 BACKGROUND

In this section, we give the details of privacy data source categories, then show procedures of the privacy theft, and finally introduce the mode of multi-process collaboration.

2.1 Privacy data source

As Samuel D. Warren and Louis D. Brandeis defined in 1890, “privacy is the right to be let alone.” In a broad sense, the privacy is the independence of individual. It includes many aspects such as physical privacy, spiritual privacy, and information privacy. In computer science, the privacy of data that contain private information is called data privacy.

According to the storage form in the computer, we divide the privacy data source into the following four categories:

  • 1.Normal file data source

Most privacy data sources are stored statically as text file, image file, audio file, and video file such as personal or confidential notes, resumes, documents, photos, and video records. These kinds of data sources are prone to be read and sent out.

  • 2.Application-related data source

Some frequently used application such as web browser, video player, e-mail client, and address book manager can store lots of user-related privacy data with special data format. These kinds of data source can be utilized and inferred to gather user favorite, website cookies, usage habit, social relationship, and so on.

  • 3.System-related data source

System information is also an important kind of privacy data source. The basic machine information of computer, operating system (OS) profile, system configuration, and user configuration are all prone to leakage and are analyzed to gather privacy information. For example, media access control address is unique and persistent to a network interface, which can be used to track the system and its user. A machine name can also reveal personal information.

  • 4.Dynamic data source

Another important kind of privacy data source is different from the aforementioned three kinds. They are dynamically input into the system and without statically storing. For example, many application clients require user registration or login; the username and password are typed from keyboard. The message text in instant messaging software also contains privacy data. The registration forms of many applications, which include private information such as real name, gender, home address, are dynamic privacy data sources too.

2.2 Theft procedure

We divide the process of privacy theft into two procedures: unauthorized data access and covert network transmission.

  • 1.Unauthorized data access

Unauthorized data access means two kinds of illegal data manipulation on privacy data source. One is that an irrelevant process accesses the data source. The other one is manipulating the privacy data source beyond constraint.

  • 2.Covert network transmission

Covert network transmission may have the aid of different transport protocols, and we focus on raw socket, FTP and HTTP for their prevalence and easy usage.

2.3 Multi-process collaboration

In order to avoid detection from an AVS, multi-process malware adopts two or more processes to accomplish their malicious task. These processes work together collaboratively, and there are mainly three work modes as follows.

  • 1.Relay race mode

This kind of malware splits its main functionality flow into two or more parts and creates multiple processes for each part, respectively. These processes work sequentially as a “relay race.” For example, Ramilli et al. in [2] adopt three processes to implement the function of the original malware BullMoose. The first one saves an exploited HTML page onto the local hard drive. The second one changes the Microsoft Windows registry key to make Internet Explorer the default browser program. The third one causes IExplorer.exe (the executable for Internet Explorer) to be opened with the exploited HTML page. Because each process only implements part of the malicious functionality, they show behaviors such as benign software and would not be alerted by AVs. Relay race mode can easily avoid behavior detection, which focuses on only single process.

  • 2.Master slave mode

Master slave mode is also widely used by malware author. In general, there are one master process and multiple slave processes. Master process is a daemon process that contains no malicious behavior and work stealthily, and slave processes are designed for actual malicious tasks. For example, the recent widely spread virus Alman works under the master slave mode. It extracts the master process “nvmini.sys” from its body, which is implemented as an Windows NT core driver (kernel mode). Then, the driver is launched as a service on each system startup. The virus injects the “linkinfo.dll” library code to the explorer.exe process address space as the main slave process. Master process creates and controls the slave processes and recreates new slave processes when old slave processes are detected and terminated by a user or AVs. Master slave mode makes malware parasitize the target host in a long term if the master process is not detected.

  • 3.Dual active mode

To increase survivability of malware, dual active mode is another popular multi-process technique. For example, the “falling star” trojan, which works with two processes “Internet” and “systemtray”, is under dual active mode. Dual active is originally used to protect vital process in important industry system such as bank or telecom business system. These two or more process with same functionality work simultaneously and watch over each other; if any process is terminated, the other process immediately take over the current work and restart the crashed process. Although dual active mode used by malware cannot avoid detection by AVs, it can significantly increase the difficulty of removal.

Next, it is necessary to show details of our PPN models; thus, we will discuss the definition, model building, and model analyzing in next section.

3 PRIVACY PETRI NET

Our analysis of the privacy theft mainly depends on an abstract model we presented called PPN. PPN is a kind of high-level Petri Net that focuses on private information and has three main features. Firstly, PPN has formal mathematical definition of syntax and semantics. This provides the precise specification of the target malware behavior and is the foundation to define various behavior properties. Secondly, PPN has concise but powerful modeling primitives for graphical abstraction. Specific graph structures can be used to identify unique privacy theft behaviors. The graphical notions can also be assigned arbitrary functions, which we refer to as privacy function, and calculate the variation of private information during the execution of the malware. Finally, PPN is modularized and can be used to build hierarchical models. These modules are abstraction presentation of different malicious functionalities, which may be accomplished by multiple collaborative malicious processes. We can use PPN to model different subtypes of privacy theft and form-complicated models. The detail of the formal definition of PPN is as follows.

3.1 Definitions of Privacy Petri Net

Definition 1. (Privacy Petri Net) PPN is a seven-tuple (P, T, A, B, K, E, S), where

  1. P is a finite set of places. Each place in P denotes a substatus of malware execution.
  2. T is a finite set of transitions T such that P ∩ T ≠ 0. Each element in T denotes a system call or application programming interface (API) call.
  3. A ⊆ P × T ∪ T × P is a set of directed arcs.
  4. B is a finite set of non-empty privacy attribute sets. In this paper, we have four main privacy attributes: category, content, source, and destination.
  5. K is a finite set of tokens. Each token denotes an instance of privacy data source and has some privacy attributes in B. Each token exists in only one place at the same time.
  6. AF is arc function set that assigns an expression to each arc. Each expression can be used to change the attribute value of the token passed.
  7. PF is place function that assigns a source or discrimination property that denotes the role of each place to privacy thefts. More details will be shown in Definitions 5 and 6.

Definition 2. (Privacy source place) Privacy source place is a place where all the transitions connect to; these transitions denote the access of privacy data source. When these transitions occur, new tokens are created in the place and new privacy theft behaviors start.

Definition 3. (Theft discrimination place) Theft discrimination place denotes that the privacy data is stolen and sent to a remote server finally. When a token arrives at discrimination place, privacy theft is detected.

Definition 4. (PPN module) PPN module is defined for the modularization of PPN, especially for multi-process malware modeling. It is a special sub-PPN contains certain places. Each PPN module must have at least one privacy source place and one theft discrimination place to depict the complete subprivacy leak procedure. A complete PPN consists of one or more PPN modules. These modules can be used to depict not only the functionality of multiple processes but also the collaboration among them.

We have shown the basic elements that consist of the PPN. Next, we will discuss the basic work mode.

Definition 5. (Privacy attribute binding) Bind means mapping each free privacy attribute variable appearing in the arc expressions into a value.

Definition 6. (Transition occurring) A transition can occur when all the privacy attribute variables appearing in the privacy expressions of arcs connected to a transition are bound.

Then, we will give an important property of PPN.

Definition 7. (Theft reachability) Theft reachability means the property that a finite sequence of occurring that a token can move from privacy source place to theft discrimination place exists.

With the reachability property, we can verify what private information is stolen by the target malware.

Theorem 1. If a token is reachable from a privacy source place to a theft discrimination place, there exists privacy theft behavior, which can be described by the token's privacy attributes.

Proof. Each transition is related to one system call or API call, and each place is related to one local status of the malware. Thus, a finite sequence of occurring is related to a finite sequence of calls. Then, the process that a token is reachable from a source place to a discrimination place that denotes sending data is related to the behavior of target malware that accesses unauthorized source data and builds covert network transmission. This privacy theft behavior is described by the variation of the token's privacy attribute during movement from the source place to the discrimination place. Therefore, Theorem 1 holds.

3.2 Modeling privacy theft with PPN

The process for generating PPN modules from trace data is briefly shown in Algorithm 1.

image

To build a PPN for modeling a selected privacy theft procedure, we first choose appropriate substatus of malware execution to form the basic place set P and define the attribute set of each token in B. Then, we choose a typical system call or API call as transitions set T to connect the related places and add all the connection arcs into arc set A. At last, we define function set PF and AF in order to assign privacy function for each arc in A.

Next, we build PPN modules for each privacy theft procedures that we mentioned in Section 2. Here, we mainly discuss modeling on Windows platform for its prevalence.

  • 1.PPN modules for unauthorized data access

In Figure 1, when the malware begins the normal file data access procedure, a token with empty attributes is spawned in the start position “p1: file data access started.” The malware should first check the possible directories to find the file by system calls such as “NtQueryDirectoryFile,” then the corresponding transitions are triggered. As a result, the token passes through these transitions and moves into the position “p2:dir checked” with new attributes that are updated by the arc functions. Similarly, the token moves to the position p3 when the malware obtains the file handle by “NtCreateFile” or “NtOpenFile.” Finally, the malware obtains the file properties or reads the file content by “NtFsControlFile” or “NtReadFile,” and the token arrives at the discrimination position “p7:file read.” By Theorem 1, if a token reaches the discrimination position p7, we can determine that the malware actually reads the file data.Application-related data includes many categories, and almost each frequently used application has its own privacy related data set. In Figure 2, we mainly discuss IE kernel web browser here as example. A browser contains lots of private data: browser page history can be stolen by “EnumUrls” interface of “IUrlHistoryStg” object, favorite pages can be obtain by “ImportExportFavorites” of “IShellUIHelper” object, typed URL history can be got from register key “Software\\Microsoft\\Internet Exploer\\TypedURLs” and website cookies can also be stolen from “InternetGetCookie” API, even protected cookies may be obtain by “IEGetProtectedModeCookie.”System-related data is about many basic system status and configuration such as basic computer info, network info, and user info. In Figure 3, we only list frequently used calls here, such as “GetWindowsDirectory” or “GetVersion” for basic computer info, “GetAdaptersInfo” or “GetNetworkParams” for local network info, and “NetUserEnum” for user info.For dynamic data access, we mainly consider keyboard and mouse input as example in Figure 4. There are two methods to obtain keyboard and mouse input: One way uses the hook function—first set hook by “SetWindowsHook,” then use the keyboard or mouse callback function, and last release the hook. The other way is to first retrieve the window handle by “GetActiveWindow,” then obtain keystroke by “GetKeyState” or “GetAsyncKeyState,” and get mouse state by “GetMouseMovePoints.”

  • 2.PPN modules for covert network transmission

As shown in Figure 5, the malware should first create a socket by “socket” for building a socket connection, then build server side by “listen” or client side by “connect,” next transfer privacy data by “send” or “sendto,” receive remote data by “recv” or “recvfrom,” and last finish transferring by “closesocket.”HTTP connection can be implemented by two function libraries: WinInet and WinHTTP. WinINet is widely used for client building and WinHTTP has HTTP server implementation. In Figure 6, for building HTTP connection and stealing privacy data, firstly, the malware need to create Internet handle by “InternetOpen,” secondly set up connection to remote HTTP server by “InternetConnect,” then create HTTP request by “HttpOpenRequest” and add HTTP headers by “HttpAddRequestHeader,” and lastly send data by post HTTP request with “HttpSendRequest” or directly write Internet file by “InternetWriteFile.”FTP connection is used to transfer privacy data in files and especially suits for large amount of file or large-size files. In windows API, FTP service is also implemented in WinINet library, and the theft procedure is similar to HTTP connection. In Figure 7, an Internet handle and connection need to be created firstly. Next, work directory for file transferring needs to be set by “FtpSetCurrentDirectory” or “FtpCreateDirectory.” Then, privacy data are sent by “FtpPutFile” in files or written to FTP file handle by “FtpOpenFile” and “InternetWriteFile.”

  • 3.Multi-process collaboration among PPN modules

As we mentioned in Section 2.3, there are three work modes that multiple process malware adopts to accomplish their malicious task in order to avoid detection from AVS: relay race mode, master slave mode, and dual active mode. These work modes depict the collaboration among multiple processes and can be depicted by PPN modules we built earlier.

Figure 1.

PPN module for normal file data access.

Figure 2.

PPN module for application-related data source.

Figure 3.

PPN module for system-related data access.

Figure 4.

PPN module for dynamic data access.

Figure 5.

PPN module for socket connection.

Figure 6.

PPN module for HTTP connection.

Figure 7.

PPN module for FTP connection.

In Figure 8, we can find that the relay race mode contains two or more collaborative processes (Process 1, Process 2, …, Process N). Each process completes one component of the whole privacy theft functionality, which is modeled by the PPN modules (PPN module 1, PPN module 2, …, PPN module N). Each module outputs its intermediate privacy data from its discrimination position to the source position of the next module. The chain of processes forms a “relay race” of privacy data transporting. This work mode can avoid detection by AVs, which focus on only a single process, but it also requires delicate process organizing and synchronizing. The whole privacy theft functionality would fail if only any of these processes goes wrong.

Figure 8.

Relay race mode of multi-process collaboration.

Figure 9 shows the basic architecture of the master slave mode: one master process creates one or more slave processes and periodically checks the status of them. The master process contains no malicious behavior and is hard to be detected by AVs. The slave processes are responsible for actual privacy theft behavior but can be “revived” by master process when they are terminated. Therefore, the master slave mode has higher survivability than relay race mode. But the master process need extra implementation and become vulnerability of the whole architecture.

Figure 9.

Master slave mode of multi-process collaboration.

Dual active mode in Figure 10 consists of two or more mirror processes with same functionality. These mirror processes watch over each other and periodically check the status of their “brother” processes such as the master slave mode. If any of them is terminated accidently, a new mirror process is recreated at once to keep the total process amount. Dual active mode significantly not only enhances the survivability of malware processes set but also increases the development complexity of malware for the extra process synchronizing and data redundancy.

Figure 10.

Dual active mode of multi-process collaboration.

3.3 Analyzing privacy theft with PPN

After building the PPN model, we can analyze privacy theft behavior with it. The detail of our analyzing algorithm is as follows:

image

First, we intercept and log the system-call sequence during the execution of the target malware. The system calls in the sequence are filtered by different data source type and connection type in module set M and turned into call sequence S. Next, the first call will be extracted, and if the corresponding transition of this call is found, its arc, which connects to the next place, will be bound. Then, if the next place is the source place, a new token will be created in it and the token will also be inserted into theft set of tokens L as candidate. This process will be looped until all the calls in S are handled. Last, all the tokens in L will be traversed to verify the reachability to all discrimination places, and the final result of privacy theft are generated. According to Theorem 1, the privacy attributes of each token represent details of the category, content, destination, theft procedure, and multi-process collaboration of a certain privacy theft behavior.

4 EXPERIMENT

To understand the details of privacy theft and answer the four questions we presented, we collect real-world malware to apply our approach based on PPN. To the best of our knowledge, there are no existing works that focus on privacy theft malware detection; thus, we do not introduce comparative analysis with other approaches. We first give our framework of detection approach. Next, we introduce our experiment environment and malware set, then show some case studies of privacy theft behaviors with different multi-process collaboration, and last discuss the experiment result.

4.1 Approach framework

As shown in Figure 11, our framework includes three main components: runtime execution trace, network traffic sniffer, and privacy theft analyzer. The application instance is run in the virtual OS. First, the corresponding log of both execution trace and network traffic are captured by the runtime execution tracer and network traffic sniffer. Then, the gathered logs are sent to the analyzer for building our PPN model and analyzing the detail of privacy theft behaviors.

  • 1.Runtime execution tracer

The runtime execution tracer mainly focuses on the system call and API call of the program instance. We use hook functions to intercept important corresponding calls that are related to the privacy theft behavior of malware. To have more precise information and build more accurate model, we retrieve not only the call name but also the parameters and return values. The sequence of such call details are recorded into log files.

  • 2.Network traffic sniffer

The network traffic sniffer mainly helps us to observe the outbound connection and communication content. Because monitoring the network traffic from the system call level is not easy, we use this as complementary component to involve more integrated privacy theft behaviors and answer where and how is the privacy information sent to.

  • 3.Privacy theft analyzer

This component is the core of our framework. We use the logs of the other two components as basic material for building PPN model. Firstly, the logs are filtered by different data source type and connection type. Secondly, we build the sub-PPN model for each subprocedure. Thirdly, we put the log data into the model and check the privacy theft detail of the target malware. Finally, we output the category, content, procedure, destination, and the multi-process collaboration of privacy leaks behaviors.

Figure 11.

Detection approach framework.

4.2 Experiment setting

The experiment environment is designed as the architecture we presented in Section 3. We use VMware Workstation 7.0 (VMware Inc., Palo Alto, CA, USA) as the virtual machine platform to build Windows OS image. The functionality of runtime execution tracer is undertaken by API monitor2 r9 (Rohitab Batra, North Attleboro, MA, USA, http://www.rohitab.com/apimonitor). Wireshark 1.6.4 (Gerald Combs et al., Sacramento, California, USA, http://www.wireshark.org/) is used as network traffic sniffer to capture the details of privacy theft destination. The core algorithm of PPN model is implemented by Python 3.1 (http://www.python.org/). All the malware samples are installed and tested in different snapshots of the virtual host to avoid mutual interference.

We choose malware samples set from www.vxheavens.org, which collected more than two million malware corpses. These samples contain prevalent trojan, worm, virus, and other kinds of malware such as backdoor, exploiter, rootkit, and hacktool. Because the quantity of the trojan family is the largest and the characteristic of trojan is likely to steal privacy data, these trojan samples are further divided into trojan-gamethief, trojan-banker, trojan-psw, and others.

4.3 Case study

Next, we will choose one typical malware sample for each kind of multi-process collaboration work mode and give more detail about its privacy theft behavior.

  • 1.Relay race mode

M1 is a variant of the Nilage family. This malware is a kind of trojan-gamethief that is designed to collect privacy information such as online-game account and password from local file. The two relay processes respectively accomplish the file access and the privacy data transferring. In Figure 12, the call sequence fragments are listed:Figure 13 shows the PPN model of M1:According to the call sequence “NtOpenFile, NtReadFile, InternetOpen, HttpOpenRequest, HttpAddRequestHeaders, HttpSendRequest,” we find that M1 sends sensitive local file content to remote HTTP server. The theft files include the username and password store file of different applications. Two processes of this malware complete the whole leak procedure under the relay race mode. Process 1 opens the local file and reads the file content, and then process 2 build HTTP connection for sending the file content to a remote server.

  • 2.Master slave mode
    1. M2 is a variant of e-mail-worm Badtrans family. This malware has one master process and two slave processes to collect the website cookie and the keystroke log. The call sequence fragments are shown in Figure 14.

Then, we build the PPN model as shown in Figure 15.According to the call sequence “InternetGetCookie, SetWindowsHook, InternetOpenW, FtpCreateDirectory, FtpPutFile,” we find that the cookies of some website, and the keystroke log will be post by M2 in the background to remote server with FTP protocol. The leak procedure is undertaken by four processes under master slave mode. Two slave processes collect the user keystroke and cookies of website, respectively, and the other slave process packs and sends the privacy data as a single file to a remoter FTP server. One master process is responsible for managing the other slave processes. It creates the other slave processes, checks their status, and restarts the crash ones.

  • 3.Dual active mode

M3 is variant of Safesurf family. This malware is a kind of trojan. It hides itself and steals the system configuration and user info in order to help other malware. Figure 16 shows the call sequence fragments.

Figure 12.

The call sequence fragment of M1. (a) M1 creates file handle with “NtOpenFile” in Figure 1. (b) M1 reads the local file with “NtReadFile” in Figure 1. (c) M1 creates the Internet handle with “InternetOpen” in Figure 6. (d) M1 creates HTTP request with “HttpOpenRequest” and add the theft data in request headers with “HttpAddRequestHeaders” in Figure 6. (e) M1 posts the cookies to web server with “HttpSendRequest” in Figure 6.

Figure 13.

PPN for M1's behavior.

Figure 14.

PPN for M2's behavior.

Figure 15.

PPN for M2's behavior.

Figure 16.

The call sequence fragment of M3. (a) M3 gets the machine name with “GetComputerName” in Figure 3. (b) M3 gets the user name with “GetUserName” in Figure 3. (c) M3 build socket handle with “Socket” and “Bind,” then connect to remote server with “Connect” in Figure 5. (d) M3 send the theft data to remote server with “Send” in Figure 5.

We build the PPN model as shown in Figure 17.

Figure 17.

PPN for M3's behavior.

According to the call sequence “GetComputerName, GetUserName, Socket, Bind, Connect, Send,” we find that M3 collects the machine name and the user name, and sends them to its server by the background socket connection. There are two mirror processes with consistent functionality that works under dual active mode. Each process contains two modules, which collect the system privacy data and send to remote socket server, respectively. The two process watch over each other. Once one of them is terminated by users or AVS, the other immediately recreates it.

4.4 Result overview

We pick 160 malware samples that, at least, contain one kind of data privacy theft behavior based on the description of AVS. We calculate how many malware samples are in each work mode, and each malware type is detected by our approach based on PPN. For instance, 4/4 in the left-top cell means that we detect four in all the four trojan samples, which has relay race work mode.

As shown in Table. 1, we achieve 88% (139/160) detection rate towards the whole sample set. Our approach can effectively detect the privacy theft behavior of malware for various malware categories including trojan, worm, hacktool, backdoor, and others. And our approach can deal with different kinds of multi-process collaborative work mode we presented.

Table 1. Privacy theft behavior detection result.
 Relay raceMaster slaveDual activeSum
Trojan8/924/2630/3562/70
Worm13/148/1012/1433/38
Hacktool 11/128/919/21
Backdoor 2/26/78/9
Others1/15/611/1517/22
Sum22/2450/5667/80139/160

We also found other useful information as follows:

  1. Trojan samples with privacy theft behavior are the most in all the malware categories (70/160, 70 trojans in all the 160 samples) because of its unique characteristics. Trojan is designed to gain control of the target host from inside and then collect useful information. Especially, the dedicated trojan for stealing the bank account or online-game account has formed a main branch in the trojan family tree. Worms in some occasions steal privacy for fast and wide spreading; for instance, e-mail-worm steals the contact list for forwarding its copy. Backdoors tend to steal the application association privacy and system association privacy because backdoors usually utilize the vulnerability of an application or system to set up a covert entrance on a target host. Hacktool is similar to a backdoor and steals system association privacy more.
  2. In all the three multi-process work modes, dual active mode is the most prevalent (80/160) for its high survivability and easy usage. It can even cooperate with other work modes. Master slave mode (56/160) is also widely used by malware authors for its stealthy master process and recoverable slave process. Relay race mode appears less because of difficult usage and lower survivability but can still be found in some worms that first inject part of its code and then download the remaining part of the payload.

5 RELATED WORK

In this section, we discuss the multi-process malware firstly. Secondly, we give the closest research findings to this paper which use Petri Net as abstract model. We also introduce some works using other graphic and non-graphic model for software behavior modeling especially the malicious behavior modeling.

5.1 Multi-process malware

Ramilli et al. [2] presented the attack approach based on multi-process malware. They describe the technique for evading such detection by distributing the malware over multiple processes. Their attack approach requires two steps. The first step is to place the malware components onto the system in such a way that each component can be executed to create a process that coordinates with one or more of the other component processes. The second step is to run each component individually. The combined actions of these processes are equivalent to the single malware. The multi-process technique of malware originated from the multi-stage technique used in some worm and virus such as dichotomy and RMNS. Some previous works also tried to help analysts understand how multi-stage attacks work and spread [14, 15, 17].

5.2 Petri Net

To the best of our knowledge, we are the first ones to analyze privacy leak behaviors with Petri Net. But there are still lots of works of analyzing other behaviors of program or system with Petri Net. Wang et al. first presented Stochastic Game Nets model and apply in the competitive game analysis of network behavior [18, 19]. They further solved the modeling and quantitative analysis of competitive game behaviors based on Stochastic Game Nets [20-23]. Gao et al. [24] judged trojan-like features of software using Stochastic Petri Nets and supported quantitative analysis for the behaviors of the target software. Tokhtabayev et al. [25] tried to find inter-process and intra-process malicious functionalities in software behaviors. The functionalities of interest were defined in the abstract system domain through activity diagrams, and the specified functionality was recognized by Colored Petri Net. They also built behavior-based intrusion detection systems based on this approach to offer an effective solution against modern malware [26]. Liu et al. [27] used a combination of techniques from the behavior monitors and Colored Petri Net for detecting virus and worms. The malicious behavior was represented as Petri Net, and the notions of initial states and final state are used to define matching in this model. Ho et al. [28] proposed an intrusion detection architecture combining partial order planning and executable Petri Nets to detect intrusions with multiple sources and intrusions where only incomplete behavioral data is available. They presented Partial Order State Transition Analysis to increase the flexibility of the traditional state analysis approach by allowing unordered events in the signature action sequence.

5.3 Other graphic models

Petri Net is not the only graphic model that suits for analyzing behavior data; other graphic models such as control flow graph, behavior graph, and hierarchical behavior graph are also used to analyze the program behavior. Bruschi et al. [29] found that next-generation malware will be characterized by the intense use of polymorphic and metamorphic techniques. They proposed a strategy for the detection of metamorphic malicious code inside a program P based on the comparison of the control flow graphs of P against the set of control flow graphs of known malware. Christodorescu et al. [30] defined a new graph representation of program behavior and a mining algorithm that constructs a malicious specification. Their algorithm inferred the system-call graphs from execution traces and derived a specification by computing the minimal differences between the system-call graphs of malicious and benign programs. Fredrikson et al. [13] implemented HOLMES to extract data dependence graphs and distinguish the malware from benign applications based on graph mining and concept analysis techniques. Martignoni et al. [31] addressed the semantic gap problem in behavioral monitoring by using hierarchical behavior graphs to infer high-level behaviors from myriad low-level events. Johnson et al. [32] proposed a differential slicing approach that automates the analysis of two runs of the same program that exhibit a difference in program state or output. A causal difference graph that captures the input differences that triggered the observed difference is outputted.

5.4 Other formal models

Some formal models without graphic presentation were also used to abstract the application behaviors from system calls. Christodorescu et al. [33] described malicious behavior by templates and presented a malware detection algorithm that addresses this deficiency by incorporating instruction semantics to detect malicious program traits. Jacob et al. [34] defined a generic approach for behavioral detection based on two layers. The abstraction layer is specific to a platform and a language. It interprets the collected instructions, API calls, and arguments, and classifies these operations. The detection layer relies on parallel automata parsing attribute-grammars where semantic rules are used for object typing (object classification) and object binding (data flow). Kinder et al. [35] introduced the specification language Computation Tree Predicate Logic, which extends the well-known logic Computation Tree Logic, and described an efficient model checking algorithm. Lanzi et al. [36] proposed a system-centric view to model the activity of benign programs. They argued that benign programs in general follow certain ways in which they use OS resources (such as the file system and the registry).

6 CONCLUSION

In this paper, we updated PPN from our previous work [16] to overcome the new challenge of the multi-process malware. We build PPN modules of different privacy theft subprocedure for analyzing the detail of privacy theft behavior. With these modules, we modeled three main kinds of multiple process collaboration mode adopted by new malware, which are called relay race mode, master slave mode, and dual active mode.

We apply our approach on real-world malware, and the result shows that our approach can improve the detection technique for multi-process privacy theft malware. We achieve 88% detection rate among the whole malware sample set. In selected case study, we give the typical multi-process collaboration under the three work modes of malware sample. We also describe the detail of category, content, destination, and procedure of privacy theft behavior.

In the future work, we will take into account privacy theft behavior analysis and detection for more kinds of wide-spread malware, especially the malware that hijacks and injects benign software. We also extend our PPN model and detection approach to more platforms such as Linux, Mac, and smart phone OS.

ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China (nos. 61173008 and 60933005), Projects of Development Plan of the State High Technology Research (no. 2012AA011003), and National Science Supported Planning (no. 2012BAH39B02).

Ancillary