The Technical Practice of Heterogeneous Mixed Arrangement in vivo Internet- vivo Internet Technology- News Fast Delivery

Author: Vivo Internet Algorithm Team – Shen Jiyi

This article is compiled based on the content of Mr. Shen Jiyi’s live speech at the “2022 vivo Developer Conference”. Reply to[2022 VDC]on the official account to obtain relevant information on the topics of the Internet technology sub-session.

The shuffling layer is responsible for integrating the results of multiple heterogeneous queues, such as advertisements, games, and natural data. It needs to obtain the optimal solution under the multiple constraints of upstream, downstream and business, which is relatively complicated and difficult to control. This article mainly introduces some explorations and thinking of the vivo advertising strategy team on the mixed arrangement of information flow and app store from the perspective of business and model.

1. Background introduction

First of all, let me introduce what is mixing. The so-called mixed arrangement, as shown in the figure, means that under the premise of ensuring user experience, it is necessary to reasonably mix heterogeneous content in different queues to achieve optimal revenue and better serve advertisers and users.

Mixedcore challengeReflected in:

The modeling goals of different queue items are different, so it is difficult to directly compare them. For example, some queues are modeled according to ctr, and some queues are modeled according to ecpm, which cannot be directly compared.
Candidate queues are often constrained by a large number of product rules, such as interval constraints, quantity retention, and first place constraints.
Since the candidate queues are generated by the fine-sorting algorithms of upstream parties, the order of the candidate queues cannot be modified during the mixed sorting due to business restrictions, that is, the order-preserving mixed sorting needs to be realized.

This time, the introduction is mainly about the mixed arrangement of vivo information flow and store scenes.

Vivo’s information flow scenarios include browsers, i-videos, negative one-screen, etc. It is characterized by numerous scenarios, high drop-down depth, various advertising forms, and strong user personalized needs. As for the store scene, it is an overall vertical scene,

It involves a multi-party balance of advertising, games, and natural volume, and it is necessary to obtain a comprehensive optimal solution under strict requirements such as volume preservation and user experience. In the following, we will introduce the characteristics of these two scenarios one by one.

2. Practice of Information Flow Mixing

2.1 Introduction to information flow shuffling

We begin to introduce the shuffling practice of the information flow scene.

For the information flow scenario, as shown in the figure below, the main problem to be solved on the shuffling side is the shuffling of content queues and advertisement queues. That is, how to insert advertisements into appropriate positions while balancing the user experience and the interests of advertisers.

For traditional information streaming media, the main mixing method in the early stage may be mainly based on fixed bit templates. That is to say, the operator manually determines the insertion relationship between the advertisement and the content, which is simple and direct.

but also brought threeobvious problem:

On the user side,Advertisements appear with the same probability in preferred and non-preferred scenarios, detrimental to user experience.
From a business perspective,Traffic is not delivered accurately, business service efficiency is low, and advertiser experience is poor.
platform side,Resource misallocation leads to waste of platform resources.

2.2 Research on Industry Solutions

Next, we will introduce several common solutions in the industry.

Take the solution of a workplace social platform as an example. It sets the optimization goal to optimize the revenue value under the premise that the user experience value is greater than a certain value. For the advertisement to be inserted, the user experience is monetized, and the overall value is weighted with the commercial value.

If the overall value is greater than the value of user experience, the advertising content is placed, otherwise the product content is placed. In addition, constraints such as intervals are also taken into consideration during delivery, as shown in the figure on the right.

His method is simple and straightforward, and many teams have adopted similar solutions to achieve better results. However, this scheme only considers the value of a single item, does not consider the interaction between items, and lacks consideration of long-term benefits.

Next, I will introduce the scheme of a small video. They use the method of reinforcement learning to mix and arrange. This scheme abstracts the problem of information flow shuffling into a sequence insertion problem, abstracts the insertion of different advertisements into different slots into different actions, and selects through reinforcement learning. Integrate advertising value (e.g., revenue, etc.) with user experience value (e.g., declines and exits) when considering reward design. Balance the two by adjusting the hyperparameters.

However, this solution is highly dependent on engineering and has been mainly tested offline in the paper, lacking online analysis. And the model only considers single advertisement insertion, not multiple advertisements.

Specific to the iteration of the vivo information flow scene, the shuffling iteration includes fixed bit shuffling,Qlearning There are three stages of shuffling and depth solution space shuffling.

The overall idea is to accumulate samples through a simple reinforcement learning program in the Qlearning stage and quickly explore benefits. Subsequent upgrades to deep learning solutions.

2.3 Qlearning mixed arrangement

The above is the basic process of reinforcement learning. The biggest feature of reinforcement learning is learning in interaction. In the interaction with the environment, the agent learns knowledge continuously according to the rewards or punishments obtained, and adapts to the environment more. State, reward and action are the three most critical elements in reinforcement learning, which will be expanded in detail later.

What are the benefits of the Qlearning shuffling mechanism of vivo information flow? First of all, it will consider full-page benefits and long-term benefits, which meet the demands of multi-browsing scenarios.alsoQlearningThe model can run in small steps and quickly verify the effect while accumulating samples.

The current overall system architecture, the mixed arrangement system is located behind adx, after receiving the content queue and advertisement queue, passQlearning The model issues weight adjustment coefficients, adjusts the weight of advertisements, and generates fusion queues after superimposing business strategies.and user behavior triggersQlearningModel updates.

The operating principle of the Qlearning model is shown in the figure. First, initialize the qtable, then select an action, update the qtable according to the reward obtained by the action, and consider both short-term and long-term benefits in the loss function.

In the practice of vivo, in reward design, we comprehensively consider user experience indicators such as duration and advertising value. After smoothing the two, we weigh them through hyperparameters. In terms of action design, the first phase adopts a numerical method to generate advertising weighting coefficients, which act on the advertising fine ranking score and mix with the content side to achieve mixed ranking.

The status design includes four parts: user characteristics, context characteristics, content side characteristics and advertisement side characteristics.Pairs like statistical features and contextual featuresQlearningThe model has a big impact.

In the vivo information flow scene,QlearningThe mixed row has achieved good results and has covered most of the scenes.

2.4 Depth position type mixed row

QlearningThere are certain limitations to shuffling:

Qtable has simple structure and small information capacity.
The Qlearning model can use limited features, making it difficult to model detailed behaviors such as sequences.
The current Qlearning mix depends on the upstream scoring, and the fluctuation of the upstream scoring will cause the effect to fluctuate.

to solveQlearningproblem, we developed a depth-positional shuffling.In the shuffling mechanism, the original numerical type is upgraded to a positional shuffling that directly generates positions, and in the model itself we useQlearningUpgraded to deep learning.

This brings up 3benefit:

Decoupling with the upstream, greatly improving the stability of mixing
Deep network, can accommodate large amount of information
Ability to consider item interaction between pages

Our overall model architecture is the industry mainstream model architecture similar to the two-tower dqn. Some state information mainly imported from the left tower includes user attributes, behaviors, etc., and action information is passed in from the right tower, which is the basic information of the solution space arrangement.

It is worth mentioning that we will integrate the solution of the previous brush into the current model as a feature.

The new solution space model has larger action spaces and higher ceilings. However, sparse actions are difficult to learn fully, which can easily lead to inaccurate predictions. In order to solve this problem, we add random experiments with small traffic online to improve the hit rate of sparse actions and enrich the diversity of samples.

As one of the most important features of the model, the sequence feature is also one of the important features of the reinforcement learning model to describe the state. We have made some optimizations to the sequence. In the sequence attention module, in order to solve the matching degree between the user’s historical interest and the advertisement to be inserted, we use the transformer to describe the user behavior sequence information; and then describe the matching degree through the operation of the advertisement to be inserted and the sequence attention. In addition, in the sequence match module, we introduce prior information to generate strong cross features to supplement attention; for match weights, information is extracted through CTR, hit or not, time weight, and TF-IDF.

3. App store mix

3.1 Introduction to store shuffling

Next we introduce the app store shuffling module.

The core problem of store shuffling is to realize the shuffling of advertising queues and game queues. As shown in the figure, the definition of the ranking points of advertisements and games is different, so it is difficult to directly compare. In addition, the recovery cycle of intermodal games is long, and the LTV is difficult to estimate. Even if all games are sorted according to ecpm, it is difficult to guarantee the effect.

Sort out the problems faced by the app storecore challenge:

There are many business parties involved, and it is necessary to achieve comprehensive optimization while meeting the requirements of user experience, advertising, and games.
The mixed arrangement of stores often has related demands such as quantity preservation, which cannot be related to the overall income. The pursuit of overall income will inevitably change the result of quantity preservation and cause conflicts. How to achieve the overall optimization while satisfying the guaranteed quantity?
Different from information flow, stores are high-cost consumption scenarios, and user behavior is sparse. Many users will only download once in a long time.
The estimation of game LTV is a difficult problem in the industry. How to provide a certain amount of room for error for game LTV on the mixed side?

Going back to the shuffling in the vivo app store, the overall iteration includes four stages: fixed position shuffling, PID guaranteed volume, constrained shuffling, and shuffling refined shunting.

3.2 PID quantity preservation

First introduce the PID scheme, PID originally originated from the field of automation. In the early stage, in order to respond to the demands of the business side, refer to the mainstream solutions in the industry, and initially realize the mixed arrangement capability by maintaining the volume of advertisements and games. However, the scheme is relatively simple, and PID is difficult to be associated with the revenue target, and it is difficult to achieve the optimal revenue.

3.3 Constrained shuffling

There is a certain degree of conflict between quantity assurance and revenue maximization. Under the constraint of quantity preservation, how to achieve the optimal comprehensive business income is the biggest difficulty.

The mix of vivo stores adopts the idea of traffic splitting and fine-tuning, and then rearranges after PID volume preservation, comprehensively considering the balance of user experience, advertising revenue, and game value. In view of the conflict between rearrangement and PID quantity preservation, rearrangement is only effective for some positions, so that revenue exploration can be carried out in some traffic, such as under the first screen, while meeting the demand for quantity preservation.

In the rearrangement layer, we initially considered using the shuffling scheme of information flow and using reinforcement learning for shuffling.But there are 2question:

Rearrangement only takes effect for the first brush, and lacks the state transfer of conventional reinforcement learning.
Compared with the information flow scene, the store scene involves more business parties. How to consider the trade-off of user experience, advertising revenue, and game value is a more complicated issue.

In order to adapt to the characteristics of the store scene, we made someAdaptation and optimization:

First of all for loss. Different from traditional reinforcement learning, since the store scene behavior is sparse and only takes effect on the first screen, and lacks state transition, we set gamma to 0, and the whole becomes a state similar to supervised learning to improve system stability.
In the design of the reward, we comprehensively considered multiple factors such as the entire page game revenue, advertising revenue, and user experience to achieve optimal revenue.
In the previous phase of action design, numerical schemes were still used.

This version has achieved good results in the mixed arrangement of the vivo store, and it has been fully sold.

3.4 Mixed Arrangement and Refined Distribution

Based on the constrained rearrangement, we think about whether we can go furtheroptimization.

Firstly, the rearrangement candidate set is generated by PID, which is not globally optimal.
Secondly, when the candidate set is all advertisements or all games, there is no effective space for current rearrangement (this line accounts for more than half).

So how to meet the guaranteed quantity and further realize the optimal income?

We began to try to mix and fine-tune the distribution, and remove the quantity guarantee limit for some branches, and release the constraints. This makes PID focus on meeting business demands such as quantity maintenance, and the model focuses on exploring better space.

In the current version, when a request comes, we will judge whether it is high-quality traffic according to the distribution module. For high-quality traffic, we will use the shuffling model to explore the income. For low-quality traffic, we will use PID to maintain the volume, and fuse the final results. In this way, the rearrangement strategy can take effect for all requests in part of the traffic, and the overall capacity is also within the normal range.

At present, the diversion methods we have tried include commercial value diversion, game preference diversion, advertising space diversion, experience mechanism diversion, etc.

Specific to the rearrangement model, we have also done some iterations. There are some problems with the current rearrangement layer and the numerical model:

In order to solve the problem:

We use generative models instead of numerical models to directly generate shuffling results and decouple them from upstream.
Drawing on the idea of context-dnn, we adopt the context-aware method to incorporate contextual influence into the generation method and label design.

Compared with the original model, this model has more obvious benefits in terms of experimental traffic, and is not affected by upstream scoring and is more stable.

4. Future Outlook

Regarding the outlook for the future, it includes four aspects:

Model optimization:In-depth optimization of mixed arrangement, more refined modeling, integration of more real-time feedback signals, improved model effect, and more personalized modeling.
Cross-scene linkage:Try cross-scenario linkage and mixed arrangements and other solutions to achieve the optimal exchange ratio, which is optimal for the entire scene.
Unified Paradigm:A unified hybrid paradigm for sequence generation and sequence evaluation is established for each scenario.
Mixing on the end:Try on-device mixed arrangement to capture user interest in a more timely manner and improve user experience.

Heterogeneous mixing has encountered many challenges in the exploration process of vivo Internet, and has also achieved certain benefits.

Interested students are welcome to leave a message for exchange and discussion.

END

Leave a Comment Cancel Reply