The optimization practice of in-depth unified rough ranking in Taobao’s main search- Personal Space of Big Taobao Technology- News Fast Delivery

Two-stage sorting (rough sorting-fine sorting) was originally a sorting framework proposed due to system performance problems. For a long time, the positioning of rough sorting has been a degraded version of fine sorting. However, we found that through some technical means, rough sorting can The collection goes beyond fine row.By re-examining the relationship between rough sorting and fine sorting, and proposing a new evaluation system called global hitrate, combined with sampling optimization, distillation and other means, we increased the turnover of the search market by about 1.0%.

background

▐ overview

Taobao main search is a typical multi-stage retrieval system, which is mainly divided into recall, rough sorting, and fine sorting stages. In the recall stage, it is composed of multiple recalls such as text recall and personalization, and the output product level is about 10^5; in the rough sorting stage, it needs to be screened separately from the three-way recall set, and the selected 10^3 level is provided to the fine sorting ; Subsequent refinement and other stages will be screened and output about top10 for exposure to users. (Note: 10, 10^3, 10^5, etc. in the following all represent orders of magnitude, and the values are only for illustration, only their relative sizes have reference significance)

Among them, the essence of rough sorting (the main search is sometimes called sea selection) is to output the best set from a large number of candidate sets. Although they are all sorted, they are very different from the goals of fine sorting. In terms of goals, they are actually different from recall. more similar. At the same time, we need to use the method of sorting to complete, so common papers and methods tend to continuously imitate and approach fine sorting. After nearly two years of exploration and practice under the main search, the biggest difference between fine sorting and rough sorting can be concluded from the goal: fine sorting focuses on the sorting of head products, while rough sorting focuses on the sorting of waist products. Many methods in the industry to improve the consistency of fine sorting and rough sorting usually have no substantial effect on the main search. There are two main reasons: First, fine sorting only uses exposure sets as training samples. Even if we verify offline The fine sorting is directly thrown into the rough sorting stage for scoring, and the result is also terrible. The scores of a large number of unseen samples tend to be random; the second is how to select high-quality subsets in the recall set. Some subsets are of sufficient quality, so the follow-up problem is not necessarily to keep the rough and fine sorting consistent, but it may be that the fine sorting and the rough sorting need to be consistent.

Last year, our team adopted the “three tricks” of multi-objective optimization + negative sample expansion + listwise distillation, and achieved remarkable results in the rough sorting stage of the main search. difference, and under the verification of the recall set hitrate in each way, it shows that the rough sorting model significantly exceeds the fine sorting model, but due to time reasons, we only made a relatively one-sided attempt. We have backtested all the improved functions and made further improvements and enhancements at the same time.

In terms of evaluation indicators, since last year we only had a one-sided indicator (NDCG) to measure the consistency with fine scheduling,Unable to measure recall-coarse row loss, and then it is impossible to measure the quality of negative sampling for rough sorting, and there is a big deviation between the goal of rough sorting and the actual optimization direction of rough sorting. worse. Therefore, this year introduces a brand new evaluation metric “global transaction hitrate“As the most important evaluation standard for rough sorting, and after a month of systematic analysis of the impact of the global transaction funnel loss from recall -> rough sorting -> fine sorting combined with different optimizations corresponding to online GMV, not only for this The effectiveness of the evaluation indicators has been tested, and the optimization space and optimization goals of each stage have been standardized and unified to a certain extent. At the same time, when it comes to the offline optimization goals of the rough sorting stage, the global hitrate needs to be further split. Among them, there are natural differences between before and after rough sorting, inside the scene and outside the scene, and finally we propose two types of evaluation indicators through analysis and verification to describe “Coarse row -> fine row loss“and”Recall -> Coarse row loss“After analysis and refinement, these rough evaluation indicators can finally have a strong positive correlation with online indicators.

exist”After the indicators of “rough sorting -> fine sorting loss” and “recall -> rough sorting loss” are established, we will find that we can use technical means to alleviate the two problems mentioned at the beginning, “inconsistent with the goal of fine sorting” and “The rough rehearsal sample space is inconsistent with the online scoring space” can be reflected in these two indicators respectively. Among them, the problem of long-tail commodities we only solved the problem of “over-scoring” last year, but actually aggravated the problem of “low-scoring”. Some attempts have also been made, and the specific methods will be elaborated in Section 4.

▐ Model infrastructure

For the convenience of readers, in this section, we briefly introduce the optimization results of the rough ranking model last year. This is the basis for this year’s rough model optimization.

The selection of training samples is one of the important differences between the rough sorting model and the fine sorting model. In order to fit the scoring space in the rough ranking stage, the training samples of the rough ranking model consist of three parts: exposed samples, unexposed samples and random negative samples. Among them, the exposed samples refer to the samples that are exposed after being sorted by the fine sorting model; the unexposed samples refer to the samples that are sent to the fine sorting after being sorted by the rough sorting model, but have not been exposed in the fine sorting stage; the random negative samples refer to the samples under the query related category Randomly sampled sample.The samples are organized in the form of listwise, and the three samples under one request are spliced into a length of

A list of , that is, the sample dimension is

,in

Denote the corresponding lengths of exposed samples, unexposed samples and random negative samples, respectively. Last year, this part of the sample construction work was only carried out through speculation and analysis. After the global transaction index was proposed and discovered this year, this part of the work was back-tested, and it was found that the two parts of the negative sample, the unexposed sample and the random negative sample, were expanded. Comes to about 5.5 pt out-of-scene hitrate boost.

In terms of loss function, we introduce three optimization objectives including exposure, click, and transaction, so that the user-query vector and item vector can be optimized simultaneously under the joint action of multiple objectives to achieve the optimum. For the logit of each sample, first go through softmax, use NLL loss to calculate the loss function of each task, and finally add the loss function of each task to obtain the final loss function.Based on the sample organization form of listwise, we hope to achieve the goal of sorting the logit of different samples through multiple progressive tasks,EstablishExpose PV -> Click Click -> Transaction PayCorrespondence,The number of positive cases of the three is a subset of the former.We hope that the multi-objective rough sorting model can learn the order of the rough sorting and scoring products, and prioritize the recall of the products that the user is most likely to trade and rank them at the top, followed by the products that the user may click and the products that the fine sorting model may expose, and finally Those are related items only.In addition, in order to further improve the consistency between the model and the fine ranking model, we also added the distillation task of learning the fine ranking score, and let the rough ranking model learn the score of the fine ranking model on the exposed samples.
In terms of model structure, we follow the inner product-based model structure widely used in the industry. After calculating the user-query vector and item vector respectively, we calculate the inner product of the two vectors as their similarity (logit). The overall model input and output are shown in the figure below As shown (Note: Although there are many kinds of item vectors in the figure, they share a set of network structure and corresponding trainable parameters)

Global funnel space analysis (preliminary test of new indicators)
In this section, we use the newly proposed global transaction hitrate indicator to analyze the search full-link funnel. Global transactions can be divided into two types of transaction samples: transaction within the scene and transaction outside the scene. In-scenario transactions refer to all transactions generated by users in search scenarios, and out-of-scene transactions refer to transactions generated in non-search scenarios associated with the same user. Since there is no query in the non-search scene, we use correlation as the association condition to associate the user’s transaction item outside the scene with the user’s query in the scene, and require the query-item composed of the query in the scene and the transaction item outside the scene satisfy certain correlation conditions.
We attribute the scoring set in the rough sorting stage to the corresponding in-scene and off-scene transactions by way of burying points.Specifically, the multi-way recall results are uniformly sorted using the rough ranking model scoring, and the Top K calculation is truncatedSearch leads to dealand the relevantNon-search related transactionThe corresponding hitrate@K, see the figure below:

Rough model Top K hitrate

Among them, due to the correlation limitation of the search itself, the user-item pair cannot be directly used for attribution. Therefore, redundant transactions outside the scene must be filtered according to the relevance of the search intent. It can be seen that: when K is reduced from recall to 10^3 Under the circumstances, the hitrate decay of the transaction inside/outside the scene is very slow; when it is less than 10^3, the index decay gradually accelerates. Therefore, from the perspective of hitrate outside the scene, for the task of outputting high-quality products of the order of 10^3 from the recall set, Coarse row can improve the space ceiling is low.

In addition to what is shown in the figure, we have also conducted other related analysis, and an interesting conclusion is that the hitrate of rough sorting may be higher than that of fine sorting at 10^3~10^4. This conclusion will be discussed further in 4.3.2 below.

Analysis and Correction of Rough Sorting Offline Metrics

▐ Inspired by: Measuring coarse->finish loss

Coarse row hitrate@10 and other rough row headIn-scene transactions
The goal of fine-tuning optimization is to select the best quality andCompatible with the search scenario bias (can be sold in the search scenario)Top 10, or even Top 1, so if the hitrate@10 predicted by the rough ranking can be improved and close to the transaction in the scene, it will naturally be more consistent with the fine ranking to a certain extent
The NDCG/reversed pair of the rough sorting total score and the fine sorting efficiency score of exposed products
This is an indicator that we have been using consistently in recent years. It pays more attention to the consistency with the refined layout, mainly in terms of model/features/targets and other aspects compared with the refined layout depth model.
AUC
Because in some cases the calculation of the above two indicators is too difficult, or sometimes in order to compare with the AUC indicator of fine sorting, we will calculate the AUC at the granularity of each request, because the loss function of the rough sorting model is based on the sorting loss of listwise Function, the absolute value of the final output score has no actual physical meaning, and the scores under different requests cannot be directly compared, so there is no way to calculate the Total AUC.

▐ Continuing: Measuring recall -> rough row loss

hitrate@10^3 in the scene
The hitrate of the rough sorting output set in the scene must be 100% in evaluation, and can only be approximated by a set that is less than an order of magnitude smaller than the rough sorting score. It can be specifically defined according to the scoring situation of different scenarios
off-scene hitrate@10^3
Since there is basically no bias in off-scenario transactions, it is necessary to use off-scene indicators as a benchmark to judge the ability of the rough sorting model.

▐ Consistency analysis of offline hitrate increase and hitrate increase in online A/B test

In the scene

The improvement range must be different from online:

Because after the rough sorting model goes online, the candidate set for fine sorting has obvious changes.

distribution shift

and then the absolute value of the final index must be different, so for the analysis in the scene, we can only look at the final number of transactions in the scene. As for the rough hitrate@10^3, it is not meaningful to observe after it goes online, because this part of the results The online itself is also determined by the fine arrangement.
outside the scene

off-line boost

different

:

Transformation from off-scene to in-scene: If the increase in off-scene transactions is transformed into in-scene transactions, the number of transactions in the scene and the ratio of in-scene to in-scene transactions will all increase significantly.Because after the out-of-scene is transformed into in-scene, the remaining out-of-scene transactions become more relevant to the current model.

“Disaster”

, so it is in line with expectations. So in this case, we have actually achieved our ultimate goal – to increase the transaction in the scene.

There is a problem with online validation: If the number of pens in the scene does not increase, and the increase outside the scene decreases, there is a high probability that there is a problem with online validation, which needs to be checked for specific problems. Common problems are offline features, model inconsistencies, etc.
outside the scene

Increase

same(

That is, outside the scene is not transformed into the scene):

Whether the hitrate is increased in the subsequent stage of rough sorting (fine sorting, etc.) outside the scene: If there is no hitrate increase in the subsequent stage itself, it means that the newly recalled products in rough sorting are not recognized in the subsequent stage of rough sorting, which may need to be determined from features/samples/loss Find the difference from the perspective of indicators,

It is necessary to decide whether to adjust the rough row or the fine row and other stages according to the specific situation.

Optimization

we are at”
The optimization work is carried out on the basis of the model introduced in the chapter “Model Basic Structure”. According to the analysis results in the third section, the optimization of the model mainly includes two parts: the first is to reduce the loss of the products output in the rough sorting stage in the fine sorting stage, This part of the work mainly focuses on improving the ability of the rough sorting model, including model distillation, feature introduction, model structure enhancement, etc.; the second is to reduce the loss of recalled high-quality products in the rough sorting stage. This part of the work mainly focuses on samples Distribution optimization.

▐ Reduce rough-finish loss

Further Expansion of Distilled Samples

In last year’s work, we mentioned that we used the scoring of the exposed samples of the fine model to guide the training of the rough model. To put it simply, we use the scoring of the fine-tuning model as softmax on the exposure samples, and introduce the loss function of distillation. The introduction of the distillation task has several advantages. On the one hand, it can improve the consistency between the fine sorting model and the rough sorting model, making the products discharged from the rough sorting model easier to be accepted by the fine sorting model; There are many rough sorting models, and the model structure is more complex than that of the rough sorting model. Therefore, it can be considered that the model capability of the fine sorting model is significantly better than that of the rough sorting model. Therefore, adding the distillation task can also speed up the convergence and improve the capability of the rough sorting model.

Based on the introduction of exposure sample distillation last year, we hope to further learn the ability of fine sorting by introducing more unexposed samples for distillation. During the experiment, we mainly tried three distillation methods: (a) Adding the distillation of M unexposed samples separately for easy control (b) removing the pv distill loss in the base and adding a distillation of N+M+K (c) Keep the distillation loss of the base, and add a distillation of N+M+K, as shown in the figure above.

However, two potential problems may be faced at this time: (i) the fine ranking scores of the M unexposed samples in the random negative samples may be higher than those of the exposed samples, resulting in a conflict with the meaning of pv loss (ii) distillation simulation is performed on the exposure task itself. The combination is carried out under the condition that the posterior P(sea selection->exposure)=1, that is, in this case, the output of the click task model is refined

Distillation as an example, pctr = pctr * 1 =P(sea selection->click) can represent the probability from sea selection to click, but for unexposed products, its posterior P(sea selection->exposure)=0, from From a mathematical point of view, pctr cannot be multiplied by 0 or directly represent the probability of audition to click. For the above two problems, we found that since the exposure set is in top10, the unexposed sampling set is random from 10 to 5000, which is usually significantly smaller than the exposure product rating, and the problems in (i) can be ignored. One of the solutions to problem (ii) is to use the label smoothing method to use pctr * scale instead of pctr * 0 to preserve the gradient, where scale << 1.

In the end, we found that the scheme in experiment (c) had the best effect, and the consistency between the rough sorting model and the fine sorting model was significantly improved. NDCG
+0.65pt
; In addition, hitrate outside the scene also has
+0.3pt
The improvement of , indicating that the model ability can also be improved after introducing more refined scores.

Further Aligning Coarse and Fine Features

The features of the rough sorting model mainly include two aspects: User/Query side features and Item side features, which correspond to the input of the two Towers respectively. We try to introduce more features into the coarse model based on the characteristics of the fine model, hoping to further enhance the expressive ability of the coarse model and improve the consistency with the fine model.

After experiments, we selected the features that have the most significant effect on the rough sorting model, mainly including user portrait features and users’ long-term transaction sequences, to help the rough sorting model model user-side information more accurately.The benefits of newly added features are obvious. In offline evaluation, hitrate has+0.4ptincrease. We also consider long-term click sequences, but there is no improvement in offline experiments. It should be noted that due to the limitation of the inner product structure of the two towers, the rough row model cannot directly use the cross feature that has a significant gain in the fine row model. For an exploration in this area, please refer to the introduction of cross features.

In the main search scenario, tens of thousands of products are rated in the rough sorting stage. In the context of limited online resources, in order to achieve a balance between performance and efficiency, the rough sorting model chooses the form of inner product to calculate the similarity between User and Item. Compared with the multi-layer fully connected structure of the refined model, the calculation speed of the vector inner product is fast, which greatly saves the online calculation cost. However, the defects of the inner product operation are also obvious. Since the information on the User side and the Item side can only interact in the User inner product phase, the cross feature of User x Item cannot be introduced. Experience in the fine-tuning stage shows that the cross-features of User x Item (such as the number of historical clicks on the item under the current query) have a significant effect on the model effect. Therefore, we consider how to introduce cross-features into the rough sorting model on the basis of keeping the inner product structure unchanged as much as possible.

In order to introduce cross features into the rough sorting model, on the basis of the original Item Tower and User Tower, we introduced the Cross Tower, as shown in the figure above. In the online servering phase, Cross Tower and User Tower are calculated online together. Similar to Item Tower, the output vector of Cross Tower also calculates the inner product with the User side vector, and the final calculation result is obtained after the last two inner products are summed. In the offline experiment, the rough row model hitrate with cross features was added +0.2pt, but in the online test, we did not observe a significant effect improvement. Considering that the introduction of Cross Tower still requires a large amount of calculation for online machines, we have suspended the exploration in this direction.

Intuitively, MLP/multi-layer FC can fit more complex relationships, and in theory, the expressive ability should be significantly stronger than the inner integral, but it does not get the actual intuitive effect in the main search scenario. Basically, two methods have been tried: 1 is to add 2 additional layers of MLP after the original inner product to facilitate the launch; 2 is to use all MLPs after attention just like fine layout. The hitrate, NDCG, and AUC of the above two offline tests did not change significantly. In order to facilitate the test, more feature tests were carried out on the refined samples. It was found that only when the input features increased significantly (including cross features), the MLP could Compared with the AUC of the inner product, there is a significant improvement, but it may not be worth the loss for too many features.

Later, we simply tried the experiment of launching a 2-layer MLP. The online rt increased by 30ms, but the online effect did not improve. The rest of the indicators were consistent with the offline test results, which were consistent with the offline experiment expectations.

▐ Reducing recall-roughing loss

In Section III, we mention the new metrics introduced by Coarse Ranking this year:global transaction hitrate. The global transaction hitrate can reduce the bias in the evaluation index and evaluate the ability of the model more objectively. In order to directly optimize this indicator, we start from the sample point of view and introduce samples from outside the scene (that is, Taobao transactions other than the search scene) into the training samples, so as to improve the performance of the model on hitrate.

We introduce out-of-scene samples by correcting positive samples. In the original samples, there are two possibilities for out-of-scene samples: one is to be used as a negative sample of the transaction task; the other is not to exist in the sample set (because the out-of-scene samples may not have been exposed, or may not even be recalled). In order not to destroy the original distribution of training samples, we first try not to introduce new samples. In this method, if a certain sample exists in the original sample, its deal label is set to 1. Experiments show that such a sample correction method has almost no impact on the hitrate in the rough sorting stage. Through further analysis, we found that if the out-of-scene transaction samples appear in the original samples, they are almost only likely to appear in the exposed samples, and rarely appear in the unexposed samples or random negative samples. In this way, such a sample correction method is actually a correction of exposed samples, that is, the samples that have been exposed but have been traded outside the scene are corrected from negative samples to positive samples. It should be pointed out that,The exposure samples are actually the samples discharged by the rough sorting modelthat is, even if this part of the sample is a negative transaction sample, the score is already relatively high for the rough sorting model (because it is a positive sample of the exposure task),The transaction/click label correction of this part of samples in the exposure stage will not bring benefits in the rough sorting stage. Therefore, if you want to increase the off-scene transaction hitrate in the rough sorting stage, you must introduce more off-scene samples that were not discharged in the rough sorting stage.

Based on the above analysis, we further introduce out-of-scenario transaction samples. On the basis of correcting the original samples, we add the out-of-scene transaction samples that do not exist in the original samples to the exposure samples, and set them as positive examples of the exposure, click and transaction tasks at the same time. In this way, we increased the sample size of the transaction sample by about 80%.After the samples are amplified, the hitrate outside the scene of the rough row model is improved0.6pt.

Long-tail problems are almost ubiquitous in statistical learning methods, and coarse-ranking models are no exception. This section presents some explorations on the optimization of long-tail items on the coarse ranking model.This section is divided into three partsfirst introduces our classification criteria for long-tail commodities; Next, we evaluate the performance of long-tail commodities in the current model, confirming thatThe model does have the problem of inaccurate scoring in the long-tail commodity collection; Finally, some of our explorations in samples are introduced, and the remaining problems are summarized.

We use the historical average daily exposure times of commodities as the threshold for long-tail classification to delineate the long-tail commodity collection. In the search scenario, it is natural to use the number of exposures as the standard to delineate long-tail products. Products with high exposure can get more user feedback and appear more frequently in training samples, so as to be fully trained; on the contrary, it is difficult for products with little exposure or even no exposure to appear in training samples, so that the model will This part of the product is rarely seen in the market, resulting in a large error in scoring.

Based on the above long-tail product classification criteria, we calculated the hitrate of the product rough sorting stage with different exposure times:

Commodity collection	Rough sorting of off-scenario transaction hitrate
All goods	75%
low exposure product	60%

The above analysis shows that compared with products with high exposure on the head, the accuracy of the model’s scoring of low-exposure long-tail products is significantly poorer, which prompts usFor this part of the product, the model still has a lot of room for optimization. At the same time, we also counted that this part of the product also accounts for a high proportion of the overall transaction, because optimizing low-exposure products can also improve the overall transaction.

As mentioned above, the reason why the model performs poorly on long-tail products is that there are few exposure samples in the set, which makes the model unable to fully train these products. In particular, in the training of the rough sorting model, we found that due to the existence of random negative sampling samples, the scoring of long-tail products will be further suppressed, resulting in a low score for long-tail products.The random negative samples are from the products that meet the prediction of the query category in the whole libraryuniform samplingYes, due to the existence of the Matthew effect, the top products only account for a small part, and most of the products are long-tail products. This leads to the fact that most of the products in the random negative samples are long-tail products, which leads to the further amplification of the probability of the long-tail products as negative samples by the random negative samples during the training process of the model. In this way, the model is more likely to learn the wrong bias of “long-tail products are usually negative samples”, resulting in low scores for long-tail products. In order to test the above conjecture, we counted the proportion of long-tail products in the random negative samples:

	Medium and low exposure products	low exposure product
The proportion of random negative samples	98%+	95%+
Percentage of Exposure Samples	50%	40%

From the table aboveIt can be seen that in the random negative samples, most of the sample exposure times are lowwhich also verifies our conjecture above.

To solve this problem, we adjust the sampling method of random negative samples, with the aim of increasing the distribution of highly exposed items in random negative samples.We borrowed from the sampling method in Word2Vec and adjusted the sampling probability of random negative sampling to

,in

Indicates the number of impressions corresponding to the product history. After adjusting the sampling method, the proportion of long-tail products has changed significantly:

	Medium and low exposure products	low exposure product
Proportion of random negative samples (after adjusting the sampling distribution)	85%	70%

Assessment scope	low exposure product	Exposure products
hitrate gain	+1.04pt	-0.32pt

After modifying the negative sample distribution, the hitrate changes under different samples are shown in the above table.As expected, on the new sampling distribution, the hitrate of the model outside the scene of low-exposure goods is improved1.04pt.However, we found that in the adjusted sampling distribution, due to the increase of medium and high-exposure products in the negative sampling samples, the hitrate of the exposed products decreased. This solution does not increase the overall hitrate much, and there may be more follow-up room for improvement.

▐ Other optimization work

Formal optimization of the loss function

In the model infrastructure chapter, we introduced the listwise multi-objective loss function used by the rough sorting model, including multi-objective tasks (exposure, click, transaction) and distillation tasks. To put it simply, in the task of multi-objective optimization, for each objective, we form a list of positive samples and negative samples, and hope that the positive samples are ranked in front of the negative samples. For a listwise task, a natural choice is the Negative Log-Likelihood loss function:
in

Indicates the inner product of the item side and user side vectors, that is, similarity;

Indicates the length of the list. Different from the tasks in other scenarios, the three tasks of the rough model, especially the exposure task, in each list,
just like
The number of books may not be 1(Especially for exposure tasks, the number of positive samples is usually 10). We found that in the scenario of rough sorting, it is unreasonable to directly apply the optimization objective of softmax in the case of multiple positive samples, and accordingly improved the loss function to make it more suitable for the optimization goal of the rough sorting model .

First, we split the negative samples in (1):

Through the above conversion, we convert the part of the negative sample into the form of the LogSumExp function:

in

Represents the set of negative samples,

Indicates except

Positive sample set other than samples

Now we make some approximations to the above loss function, which can help us understand the optimization goal of the loss function more clearly. First, we approximate LSE as a maximum function:

Next will

(also known as the SmoothReLU function) is approximated by ReLU:

It is not difficult to see from the above form that for each positive sample

the above loss functions are trying to enlarge the positive sample

The distance between the upper bounds of all other samples in the current list, where “other samples” include all negative samples,
Also includes except sample

Other positive samples other than
.In most cases, the inner product similarity of positive samples is greater than that of negative samples, that is

so formula (5) can be further written as:

It is easy to find that in the above form, the loss function is optimized in most cases bycurrent sample

and divide

The upper bound of all positive samples other than
The final effect is to increase the distance between positive samples, while ignoring the really important difference between positive samples and negative samples.

Once the problem with the current softmax function is understood, the problems found above can be avoided by slightly modifying the above formula.Specifically, in formula (5), we only need to dividePositive samples other thanitem is deleted, that is, the calculationcurrent sampleand divideThe upper bound of all negative samples other than:
Smoothly expand Equation (7) back to softmax.Correspondingly, only the current sample in the denominator needs to beAll other positive samples can be eliminated:

Such small changes in the formula have produced obvious effects in the model training, which confirms the correctness of the theoretical analysis. Experiments show that the change of the above loss function significantly improves the consistency between the rough and fine models (NDCG +0.63pt), the off-scene hitrate also increases by about0.2%.

Adjustment of precise layout and printing weight based on global analysis

According to our global funnel analysis, it is found that fine ranking and rough ranking do not necessarily have a better scoring ability at the boundary of scoring weight. We also found through experiments that when other necessary indicators such as correlation are the same, increasing the weight of fine ranking may not be the best score. It will bring about an improvement in the online effect, and even excessive scoring will lead to a decrease in the global hitrate during the exposure stage, resulting in a decrease in the number of online transactions.

From this point of view, it further shows that the rough row does not have to be consistent with the fine row. In terms of scoring ability, one pays more attention to the head and the other pays more attention to the waist, and the scoring boundary part needs to be determined according to the actual capabilities of the two.

Summary and Outlook

Compared with the previous coarse sorting version that only used fine sorting samples in the industry, the optimization of sampling methods for business scenarios, multi-objective loss fusion, and distillation methods proposed by Main Search since last year has brought a total of about 1.0% of the search market. transaction value increased.It should be pointed out that we treat the coarse->fine loss and the recall->coarse loss equally. However, from the recent offline experiments, it can be seen that most of the experiments will have two situations: one is that the offline hitrate does not increase significantly, that is, the optimization point has no effect; The row loss increases, that is, the output set of the rough row does become more high-quality, but the incremental part cannot be recognized by the fine row (or only a small part is recognized). This is actually in line with cognition to a certain extent. For example, many changes in rough sorting, such as the introduction of global samples associated with positive samples that have not been fine sorted, and changes in the distribution of random sampling samples for negative samples, are not currently available in the current experiments of fine sorting. Adding similar positive and negative samples, it is difficult to correctly score a high-quality collection of coarse increments. Therefore, the optimization of the subsequent rough sorting should not only follow the optimization of the losses of the upper part and the lower part. For the high-quality products that failed the fine sorting found in the analysis and verification, the samples of the rough sorting and fine sorting should be carried out to a certain extent according to specific problems. Alignment, cooperate with the waist sorting ability of the rough sorting and the head sorting ability of the fine sorting, and at the same time increase the consistency to a certain extent, and finally achieve a win-win situation.

¤
Extended reading
¤

3DXR technology
|
terminal technology
|
audio and video technology
server technology
|
technical quality
|
data algorithm

This article is shared from the WeChat public account – Big Taobao Technology (AlibabaMTT).
If there is any infringement, please contact support@oschina.cn to delete it.
This article participates in the “OSC Source Creation Project”, and you are welcome to join in and share it together.

#optimization #practice #indepth #unified #rough #ranking #Taobaos #main #search #Personal #Space #Big #Taobao #Technology #News Fast Delivery

▐ overview

▐ Model infrastructure

▐ Inspired by: Measuring coarse->finish loss

▐ Continuing: Measuring recall -> rough row loss

▐ Consistency analysis of offline hitrate increase and hitrate increase in online A/B test

▐ Reduce rough-finish loss

Further Expansion of Distilled Samples

Further Aligning Coarse and Fine Features

▐ Reducing recall-roughing loss

▐ Other optimization work

Formal optimization of the loss function

Adjustment of precise layout and printing weight based on global analysis

Leave a Comment Cancel Reply