ML Risk Score

PreviousKYA API NextData Indicators for Behavior Overview

Last updated 1 year ago

ML Risk Score

1. Motivation

Money laundering via crypto assets is becoming a bigger problem. Whether you are a business engaging with them or an individual holding them in investment portfolios, the likelihood of getting exposed to "dirty" assets has grown significantly as the velocity of crypto trading increases. The consequences of being linked to dirty funds are worth an entire blog elaborating at least and are hard to overstate. In this article, we introduce an ML (Money Laundering) Risk Score, which we developed in-house and plan to open-source. Before our efforts, there are good attempts made by other organizations to build similar scores. By learning from them, we reached an internal consensus that an actionable ML risk score should have three characteristics listed below:

explainable: the computation logic of this score should be transparent and understandable. In order to approximately reach this status, we choose to use a linear combination of determining factors as the main body of computation rather than black-boxed machine learning models. In order to be flexible enough to model some complicated logic while maintaining good explainability, we design three different components to accommodate different reasonings. In the next section, we will discuss them in detail.

consistent with business logic: interpretation of this score should align with human logic where a larger numerical value of this score means a higher risk of money laundering: sanction address should be at extreme risk level, one having a significant link to coin-mixers should be on the high-risk end of the spectrum, and, etc. And, the risk level should be determined by a comprehensive set of factors: nature of the business controlling a crypto address, transaction history, risky patterns revealed in the network of past transactions, etc.
extensible: the understanding of ML risk in crypto assets is still in the early stage. We aim to build a score that can stand test-of-time to support real applications in long run. Therefore, we made decisions: 1) following the design principle of software engineering - separation of concerns (SoC), the ML risk score is a linear combination of three components (discussed in the next section). Therefore, for efforts of improving the accuracy of this score modeling with existing factors, we can achieve this by fine-tuning the weights of existing parameters. For efforts of considering an unprecedented risky factor, we can incorporate it by either adding it as a risky event in the risk adjustment component or altering the way of generating a semantic view where it fits best; 2) the computation can be equally applicable to a single address or a cluster of addresses belonging to the same entity. And, because of its transparency and well-structured, this score can be combined with KYC data to build a holistic Customer Risk Rating with both off-chain and on-chain data.

2. How to Compute ML Risk Score

2.1 Design Philosophy

In the first version of the ML risk score, we set the ultimate goal to make the score actually useful in real-life applications. Therefore, we derive industry-wide consensus to design the first set of evaluation criteria to guide the development of the score. They are:

1) Amount matters: a larger transaction value with a particular category of crypto address cast more influence on the risk score.

Influence: 1 million USD > 1000 USD

2) Direction matters: sending exposure is more influential than receiving exposure. Since we consider that sending exposure is a result of an active intention. On the other hand, receiving exposure could be the result of passive intention.

Influence: Sending exposure > Receiving exposure

3) Direct vs. indirect matters: engaging with a counterpart directly (in the graph typology, we call a direct connection a 1-hop) is more influential than indirect interaction (linked within 2 or 2+ hops). For directly linked entities, the likelihood is quite slim that the owner of the target address does not know the directly interacted counterpart. But, for indirectly linked entities, it is possible that the owner of the target address is not aware of distant entities.

Influence: Direct exposure > Indirect exposure

2.2 Friendly math explanation

\text{ML Risk Score}=  \text{w}_{m} \cdot \text{base risk} + \text{w}_{e} \cdot \text{exposure risk}  + \text{w}_{a} \cdot \text{risk adjustment}

Base risk: it is determined by annotation information of a given crypto address. This includes the category (e.g. Scam) and its corresponding generic risk level, the entity (e.g. Coinbase), and/or its reputation for AML compliance.
Exposure risk: it is derived from the semantic view of historic transactions of a target crypto address. We aggregate the data partitioned by indirect/direct and categories in order to compute exposures.

\text{exposure risk} = \sum_{i, j, k} w_{i,j, k} \cdot \text{exposure}(i, j, k)  \\ \text{where}, i \space \in \text{\{direct, indirect\}} \\ j \space \in \text{\{sending, receiving\}} \\ k \space \in \{\text{categories}\}

Risk adjustment: the adjustment score can play a significant role in refining the final risk score. The factors triggering adjustment are expected to grow as industry-wide understanding and practices develop. At the current stage, adjustment is equivalent to multiple suspicious activities . For example: rapid fund withdrawals, peel-chain nodes, abnormal cross-chain transactions, and high-frequency small-value transactions.

2.3 Example

In this example, the category of this address is "Scam". Additionally, there are suspicious cross-chain activities associated with this address. The overall ML risk score is 8.0, indicating a high level of risk.In the Insights section, we have generated explainable natural language descriptions using LLM (Large Language Model) to provide detailed information about the recommended actions for our customer and an explanation of the risk score.