@hhlf282 2026-03-20T01:03:04.000000Z 字数 14635 阅读 13

AI 选品模型设计文档

目标: 设计 AI 选品决策模型，评估商品潜力
版本: 1.0
创建时间: 2026-03-19

1. 模型架构

1.1 整体架构

┌─────────────────────────────────────────────────────────────┐
│                      输入层                                  │
│  商品特征 + 市场数据 + 供应链数据 + 历史表现                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    特征工程层                                │
│  - 数值特征标准化                                           │
│  - 类别特征编码                                             │
│  - 时间特征提取                                             │
│  - 交叉特征构造                                             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    模型层                                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ 销量预测模型 │  │ 利润预测模型 │  │ 风险评估模型 │         │
│  │ (回归)      │  │ (回归)      │  │ (分类)      │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│  ┌─────────────┐  ┌─────────────┐                          │
│  │ 竞争度模型   │  │ 综合评分模型 │                          │
│  │ (回归)      │  │ (排序)      │                          │
│  └─────────────┘  └─────────────┘                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      输出层                                  │
│  选品推荐列表 + 评分 + 理由 + 风险提示                       │
└─────────────────────────────────────────────────────────────┘

1.2 技术选型

模型算法理由

销量预测 XGBoost / LightGBM 处理表格数据强、可解释性好

利润预测 Linear Regression + XGBoost 简单 + 非线性组合

风险评估 Random Forest / XGBoost 分类问题、鲁棒性好

竞争度评估 Clustering + Rule-based 无监督 + 规则

综合评分 Learning to Rank (LambdaMART) 排序问题最优

决策解释 LLM (Qwen3.5-Plus) 自然语言生成

模型	算法	理由
销量预测	XGBoost / LightGBM	处理表格数据强、可解释性好
利润预测	Linear Regression + XGBoost	简单 + 非线性组合
风险评估	Random Forest / XGBoost	分类问题、鲁棒性好
竞争度评估	Clustering + Rule-based	无监督 + 规则
综合评分	Learning to Rank (LambdaMART)	排序问题最优
决策解释	LLM (Qwen3.5-Plus)	自然语言生成

2. 特征工程

2.1 特征列表

商品基础特征

特征名	类型	说明	来源
`price`	numeric	商品价格 (LKR)	Daraz
`original_price`	numeric	原价	Daraz
`discount_rate`	numeric	折扣比例	计算
`category_id`	categorical	类目 ID	Daraz
`brand`	categorical	品牌	Daraz
`rating`	numeric	评分 (0-5)	Daraz
`review_count`	numeric	评价数量	Daraz
`sold_count`	numeric	销量	Daraz
`image_count`	numeric	图片数量	Daraz
`description_length`	numeric	描述长度	Daraz

市场特征

特征名	类型	说明	来源
`category_avg_price`	numeric	类目平均价格	计算
`category_avg_sales`	numeric	类目平均销量	计算
`category_growth_rate`	numeric	类目增长率	计算
`search_volume`	numeric	搜索量	Google Trends
`search_trend`	numeric	搜索趋势 (7 天)	Google Trends
`seasonality_score`	numeric	季节性评分	计算

竞争特征

特征名	类型	说明	来源
`seller_count`	numeric	卖家数量	计算
`top3_concentration`	numeric	头部 3 家集中度	计算
`price_variance`	numeric	价格方差	计算
`avg_rating`	numeric	类目平均评分	计算
`entry_barrier`	numeric	进入门槛 (1-10)	规则计算

供应链特征

特征名	类型	说明	来源
`source_price`	numeric	1688 采购价 (RMB)	1688 API
`shipping_cost`	numeric	物流成本 (LKR)	物流 API
`lead_time`	numeric	供货周期 (天)	1688 API
`moq`	numeric	最小起订量	1688 API
`supplier_rating`	numeric	供应商评分	1688 API

利润特征

特征名	类型	说明	来源
`gross_margin`	numeric	毛利率	计算
`roi`	numeric	投资回报率	计算
`break_even_sales`	numeric	盈亏平衡销量	计算
`commission_rate`	numeric	平台佣金率	Daraz
`estimated_profit`	numeric	预计利润	计算

风险特征

特征名	类型	说明	来源
`policy_risk`	numeric	政策风险 (1-10)	规则计算
`quality_risk`	numeric	质量风险 (1-10)	规则计算
`supply_risk`	numeric	供应风险 (1-10)	规则计算
`market_risk`	numeric	市场风险 (1-10)	规则计算

2.2 特征处理

# 特征处理 pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import pandas as pd
class FeatureProcessor:
    def __init__(self):
        self.scaler = StandardScaler()
        self.label_encoders = {}
        self.imputer = SimpleImputer(strategy='median')
    def fit_transform(self, df):
        # 1. 处理缺失值
        numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
        df[numeric_cols] = self.imputer.fit_transform(df[numeric_cols])
        # 2. 标准化数值特征
        df[numeric_cols] = self.scaler.fit_transform(df[numeric_cols])
        # 3. 编码类别特征
        categorical_cols = ['category_id', 'brand']
        for col in categorical_cols:
            if col in df.columns:
                le = LabelEncoder()
                df[col] = le.fit_transform(df[col].astype(str))
                self.label_encoders[col] = le
        # 4. 构造交叉特征
        df['price_to_avg_ratio'] = df['price'] / df['category_avg_price']
        df['sales_to_avg_ratio'] = df['sold_count'] / df['category_avg_sales']
        df['rating_diff'] = df['rating'] - df['avg_rating']
        return df

3. 模型设计

3.1 销量预测模型

# 销量预测模型
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
class SalesPredictor:
    def __init__(self):
        self.model = xgb.XGBRegressor(
            n_estimators=500,
            max_depth=6,
            learning_rate=0.05,
            subsample=0.8,
            colsample_bytree=0.8,
            random_state=42
        )
    def train(self, X, y):
        """
        X: 特征矩阵
        y: 实际销量 (月销量)
        """
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )
        self.model.fit(X_train, y_train)
        # 评估
        y_pred = self.model.predict(X_test)
        mae = mean_absolute_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        print(f"MAE: {mae:.2f}, RMSE: {rmse:.2f}")
        # 特征重要性
        importance = pd.DataFrame({
            'feature': X.columns,
            'importance': self.model.feature_importances_
        }).sort_values('importance', ascending=False)
        return importance
    def predict(self, X):
        """预测销量"""
        return self.model.predict(X)
    def predict_with_confidence(self, X):
        """预测销量 + 置信区间"""
        predictions = self.model.predict(X)
        # 简单置信区间计算 (实际可用分位数回归)
        std = np.std(predictions)
        lower = predictions - 1.96 * std
        upper = predictions + 1.96 * std
        return {
            'prediction': predictions,
            'lower_95': lower,
            'upper_95': upper
        }

训练数据要求:
- 样本量：>10,000 个商品
- 时间跨度：>6 个月
- 特征：上述 30+ 个特征
- 标签：实际月销量

评估指标:
- MAE < 30 (平均绝对误差 < 30 单)
- RMSE < 50
- R² > 0.6

3.2 利润预测模型

# 利润预测模型
class ProfitPredictor:
    def __init__(self):
        self.model = xgb.XGBRegressor(
            n_estimators=300,
            max_depth=4,
            learning_rate=0.1,
            random_state=42
        )
    def calculate_features(self, product_data):
        """
        计算利润相关特征
        """
        # 收入
        revenue = product_data['price']
        # 成本
        source_price_lkr = product_data['source_price'] * 60  # RMB 转 LKR (假设汇率 60)
        shipping = product_data['shipping_cost']
        commission = revenue * product_data['commission_rate']
        # 总成本
        total_cost = source_price_lkr + shipping + commission
        # 毛利率
        gross_margin = (revenue - total_cost) / revenue
        # ROI
        roi = (revenue - total_cost) / total_cost
        return {
            'gross_margin': gross_margin,
            'roi': roi,
            'total_cost': total_cost,
            'profit_per_unit': revenue - total_cost
        }
    def train(self, X, y):
        """
        X: 特征矩阵
        y: 实际毛利率
        """
        self.model.fit(X, y)
    def predict(self, X):
        """预测毛利率"""
        return self.model.predict(X)

利润计算公式:

毛利率 = (售价 - 采购价 - 物流 - 佣金) / 售价
ROI = (售价 - 采购价 - 物流 - 佣金) / (采购价 + 物流)
盈亏平衡销量 = 固定成本 / (售价 - 变动成本)

3.3 风险评估模型

# 风险评估模型 (二分类：高风险/低风险)
class RiskClassifier:
    def __init__(self):
        self.model = xgb.XGBClassifier(
            n_estimators=300,
            max_depth=5,
            learning_rate=0.1,
            scale_pos_weight=5,  # 处理样本不平衡
            random_state=42
        )
    def train(self, X, y):
        """
        X: 特征矩阵
        y: 风险标签 (0=低风险，1=高风险)
        """
        self.model.fit(X, y)
    def predict(self, X):
        """预测风险等级"""
        return self.model.predict(X)
    def predict_proba(self, X):
        """预测风险概率"""
        return self.model.predict_proba(X)[:, 1]  # 高风险概率
    def get_risk_factors(self, X):
        """
        获取主要风险因素
        """
        # 基于特征重要性分析
        importance = pd.DataFrame({
            'feature': X.columns,
            'importance': self.model.feature_importances_
        }).sort_values('importance', ascending=False)
        # 返回 TOP 5 风险因素
        return importance.head(5)

风险标签定义:

def label_risk(product_data):
    """
    标注风险标签
    """
    risk_score = 0
    # 政策风险
    if product_data['category'] in ['药品', '食品', '化妆品']:
        risk_score += 3  # 需要认证
    # 市场风险
    if product_data['competition_score'] > 8:
        risk_score += 2  # 竞争激烈
    # 供应风险
    if product_data['lead_time'] > 30:
        risk_score += 2  # 供货周期长
    # 质量风险
    if product_data['supplier_rating'] < 4.0:
        risk_score += 2  # 供应商评分低
    # 价格风险
    if product_data['price_variance'] > 0.5:
        risk_score += 1  # 价格波动大
    # 高风险阈值
    return 1 if risk_score >= 5 else 0

3.4 竞争度评估模型

# 竞争度评估 (无监督 + 规则)
class CompetitionAnalyzer:
    def __init__(self):
        pass
    def calculate_competition_score(self, category_data):
        """
        计算竞争度评分 (1-10 分)
        """
        score = 0
        # 1. 卖家数量 (0-3 分)
        seller_count = category_data['seller_count']
        if seller_count < 50:
            score += 1
        elif seller_count < 200:
            score += 2
        else:
            score += 3
        # 2. 头部集中度 (0-3 分)
        top3_concentration = category_data['top3_concentration']
        if top3_concentration < 0.3:
            score += 1  # 分散，好进入
        elif top3_concentration < 0.6:
            score += 2
        else:
            score += 3  # 集中，难进入
        # 3. 价格战程度 (0-2 分)
        price_variance = category_data['price_variance']
        if price_variance < 0.2:
            score += 2  # 价格稳定
        elif price_variance < 0.4:
            score += 1
        else:
            score += 0  # 价格战激烈
        # 4. 进入门槛 (0-2 分)
        entry_barrier = category_data['entry_barrier']
        score += (10 - entry_barrier) / 5  # 门槛越低分越高
        return min(10, max(1, score))
    def get_competition_level(self, score):
        """
        将分数转换为竞争等级
        """
        if score <= 3:
            return "低竞争 (蓝海)"
        elif score <= 6:
            return "中等竞争"
        else:
            return "高竞争 (红海)"

3.5 综合评分模型

# 综合评分模型 (Learning to Rank)
from sklearn.ensemble import GradientBoostingRegressor
class ProductRanker:
    def __init__(self):
        self.model = GradientBoostingRegressor(
            n_estimators=200,
            max_depth=5,
            learning_rate=0.1,
            random_state=42
        )
        # 权重配置
        self.weights = {
            'market_demand': 0.30,      # 市场需求 30%
            'profitability': 0.25,      # 利润空间 25%
            'competition': 0.20,        # 竞争程度 20%
            'supply_chain': 0.15,       # 供应链 15%
            'risk': 0.10                # 风险 10%
        }
    def calculate_subscores(self, product_features):
        """
        计算各维度子分数
        """
        scores = {}
        # 市场需求分数 (基于销量预测)
        predicted_sales = self.sales_model.predict([product_features])[0]
        scores['market_demand'] = min(10, predicted_sales / 100)  # 100 单=10 分
        # 利润分数 (基于毛利率)
        gross_margin = product_features['gross_margin']
        scores['profitability'] = min(10, gross_margin * 20)  # 50% 毛利=10 分
        # 竞争分数 (反向，竞争越低分越高)
        competition_score = product_features['competition_score']
        scores['competition'] = 10 - competition_score
        # 供应链分数
        supplier_rating = product_features['supplier_rating']
        lead_time = product_features['lead_time']
        scores['supply_chain'] = (supplier_rating / 5) * 7 + (30 / lead_time) * 3
        # 风险分数 (反向，风险越低分越高)
        risk_prob = self.risk_model.predict_proba([product_features])[0][1]
        scores['risk'] = 10 * (1 - risk_prob)
        return scores
    def calculate_total_score(self, product_features):
        """
        计算综合评分
        """
        subscores = self.calculate_subscores(product_features)
        total_score = sum(
            subscores[k] * self.weights[k]
            for k in self.weights
        )
        return {
            'total_score': total_score,
            'subscores': subscores,
            'recommendation': self.get_recommendation(total_score)
        }
    def get_recommendation(self, score):
        """
        根据分数给出推荐
        """
        if score >= 8:
            return "强烈推荐"
        elif score >= 6:
            return "推荐"
        elif score >= 4:
            return "谨慎考虑"
        else:
            return "不推荐"
    def rank_products(self, products):
        """
        对多个商品进行排序
        """
        scored_products = []
        for product in products:
            result = self.calculate_total_score(product)
            result['product'] = product
            scored_products.append(result)
        # 按综合评分排序
        scored_products.sort(key=lambda x: x['total_score'], reverse=True)
        return scored_products

4. LLM 决策解释

4.1 解释生成 Prompt

# 使用 LLM 生成选品建议解释
def generate_recommendation_explanation(product_data, scores):
    """
    生成自然语言的选品建议
    """
    prompt = f"""
你是一位跨境电商选品专家。请根据以下数据，生成选品建议。
【商品信息】
- 商品：{product_data['title']}
- 类目：{product_data['category_name']}
- 价格：{product_data['price']} LKR
- 采购价：{product_data['source_price']} RMB
【评估结果】
- 综合评分：{scores['total_score']:.1f}/10
- 市场需求：{scores['subscores']['market_demand']:.1f}/10
- 利润空间：{scores['subscores']['profitability']:.1f}/10
- 竞争程度：{scores['subscores']['competition']:.1f}/10 (分数越低竞争越激烈)
- 供应链：{scores['subscores']['supply_chain']:.1f}/10
- 风险：{scores['subscores']['risk']:.1f}/10 (分数越高风险越低)
【推荐等级】{scores['recommendation']}
请生成：
1. 推荐理由 (2-3 条，基于数据)
2. 风险提示 (1-2 条)
3. 运营建议 (1-2 条)
要求：
- 简洁明了，每条不超过 50 字
- 基于数据，不空洞
- 语气专业但易懂
"""
    # 调用 LLM
    response = llm_client.generate(prompt)
    return response

4.2 输出示例

【选品建议】无线蓝牙耳机
✅ 推荐理由:
1. 搜索量月增长 25%，市场需求旺盛
2. 毛利率 35%，高于类目平均 (25%)
3. 头部集中度低，新卖家有机会
⚠️ 风险提示:
1. 竞争度中等 (6.5/10)，需注意差异化
2. 电子产品质量风险，建议先小批量测试
💡 运营建议:
1. 建议首单备货 200 件，测试市场反应
2. 重点突出"长续航"差异化卖点
3. 11 月旺季前上架，抓住购物季

5. 模型训练流程

5.1 数据准备

# 数据准备 pipeline
def prepare_training_data():
    """
    准备训练数据
    """
    # 1. 从数据库加载数据
    query = """
    SELECT 
        p.*,
        ph.avg_price as category_avg_price,
        ph.avg_sales as category_avg_sales,
        ph.seller_count,
        s.source_price,
        s.shipping_cost,
        s.lead_time,
        s.supplier_rating
    FROM products p
    LEFT JOIN category_stats ph ON p.category_id = ph.category_id
    LEFT JOIN supplier_data s ON p.source_url = s.url
    WHERE p.crawl_time > NOW() - INTERVAL '12 months'
    """
    df = pd.read_sql(query, db_connection)
    # 2. 特征工程
    processor = FeatureProcessor()
    df_processed = processor.fit_transform(df)
    # 3. 标签生成
    df_processed['sales_label'] = df_processed['sold_count']  # 销量标签
    df_processed['profit_label'] = df_processed['gross_margin']  # 利润标签
    df_processed['risk_label'] = df_processed.apply(label_risk, axis=1)  # 风险标签
    # 4. 划分训练集/测试集
    train_df = df_processed[df_processed['crawl_time'] < '2025-12-01']
    test_df = df_processed[df_processed['crawl_time'] >= '2025-12-01']
    return train_df, test_df

5.2 训练流程

# 完整训练流程
def train_all_models():
    """
    训练所有模型
    """
    # 1. 准备数据
    train_df, test_df = prepare_training_data()
    # 2. 定义特征列
    feature_cols = [
        'price', 'discount_rate', 'rating', 'review_count',
        'category_avg_price', 'category_avg_sales',
        'seller_count', 'top3_concentration',
        'source_price', 'shipping_cost', 'lead_time', 'supplier_rating',
        'gross_margin', 'roi'
    ]
    X_train = train_df[feature_cols]
    X_test = test_df[feature_cols]
    # 3. 训练销量预测模型
    print("训练销量预测模型...")
    sales_model = SalesPredictor()
    sales_importance = sales_model.train(X_train, train_df['sales_label'])
    # 4. 训练利润预测模型
    print("训练利润预测模型...")
    profit_model = ProfitPredictor()
    profit_model.train(X_train, train_df['profit_label'])
    # 5. 训练风险分类模型
    print("训练风险分类模型...")
    risk_model = RiskClassifier()
    risk_model.train(X_train, train_df['risk_label'])
    # 6. 评估
    print("\n=== 模型评估 ===")
    # 销量预测评估
    sales_pred = sales_model.predict(X_test)
    sales_mae = mean_absolute_error(test_df['sales_label'], sales_pred)
    print(f"销量预测 MAE: {sales_mae:.2f}")
    # 利润预测评估
    profit_pred = profit_model.predict(X_test)
    profit_mae = mean_absolute_error(test_df['profit_label'], profit_pred)
    print(f"利润预测 MAE: {profit_mae:.2f}")
    # 风险评估评估
    risk_pred = risk_model.predict(X_test)
    risk_accuracy = accuracy_score(test_df['risk_label'], risk_pred)
    print(f"风险评估准确率：{risk_accuracy:.2%}")
    # 7. 保存模型
    print("\n保存模型...")
    joblib.dump(sales_model, 'models/sales_model.pkl')
    joblib.dump(profit_model, 'models/profit_model.pkl')
    joblib.dump(risk_model, 'models/risk_model.pkl')
    return {
        'sales_model': sales_model,
        'profit_model': profit_model,
        'risk_model': risk_model,
        'metrics': {
            'sales_mae': sales_mae,
            'profit_mae': profit_mae,
            'risk_accuracy': risk_accuracy
        }
    }

6. 模型部署

6.1 API 服务

# FastAPI 服务
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class ProductInput(BaseModel):
    title: str
    price: float
    category_id: str
    source_url: str
    # ... 其他字段
class ProductRecommendation(BaseModel):
    product_id: str
    total_score: float
    recommendation: str
    reasons: list[str]
    risks: list[str]
    suggestions: list[str]
@app.post("/api/v1/evaluate")
async def evaluate_product(product: ProductInput) -> ProductRecommendation:
    """
    评估单个商品
    """
    # 1. 获取特征
    features = get_product_features(product)
    # 2. 计算评分
    ranker = ProductRanker()
    result = ranker.calculate_total_score(features)
    # 3. 生成解释
    explanation = generate_recommendation_explanation(features, result)
    return ProductRecommendation(
        product_id=product.title,
        total_score=result['total_score'],
        recommendation=result['recommendation'],
        reasons=explanation['reasons'],
        risks=explanation['risks'],
        suggestions=explanation['suggestions']
    )
@app.get("/api/v1/recommendations")
async def get_recommendations(
    category: str = None,
    price_min: float = None,
    price_max: float = None,
    limit: int = 20
) -> list[ProductRecommendation]:
    """
    获取选品推荐列表
    """
    # 1. 从数据库获取候选商品
    candidates = get_candidate_products(category, price_min, price_max, limit)
    # 2. 批量评分
    ranker = ProductRanker()
    scored = ranker.rank_products(candidates)
    # 3. 返回 TOP N
    return [
        ProductRecommendation(
            product_id=s['product']['item_id'],
            total_score=s['total_score'],
            recommendation=s['recommendation'],
            # ...
        )
        for s in scored[:limit]
    ]

6.2 模型更新

# 模型定期更新
from apscheduler.schedulers.background import BackgroundScheduler
scheduler = BackgroundScheduler()
@scheduler.scheduled_job('cron', day_of_week='mon', hour=2)
def weekly_model_update():
    """
    每周更新模型
    """
    print("开始每周模型更新...")
    # 1. 获取新数据
    new_data = get_new_training_data()
    # 2. 增量训练
    models = load_models()
    models = incremental_train(models, new_data)
    # 3. 评估
    metrics = evaluate_models(models)
    # 4. 如果效果提升，保存新模型
    if metrics['improvement'] > 0.05:  # 提升 5% 以上
        save_models(models)
        print("模型更新成功")
    else:
        print("模型效果未提升，跳过更新")
scheduler.start()

7. 评估与监控

7.1 离线评估

模型	指标	目标值	当前值
销量预测	MAE	<30	-
销量预测	R²	0.6	-
利润预测	MAE	<15%	-
风险评估	Accuracy	85%	-
风险评估	Recall	80%	-

7.2 在线评估

指标	定义	目标值
推荐采纳率	运营采纳 AI 推荐的比例	60%
选品成功率	AI 推荐产品成功 (月销>100) 比例	50%
平均毛利率	AI 推荐产品的平均毛利率	25%
ROI	AI 推荐产品的投资回报	2.0

7.3 监控告警

# 模型性能监控
def monitor_model_performance():
    """
    监控模型性能
    """
    # 1. 预测准确率监控
    recent_predictions = get_recent_predictions(days=7)
    actual_results = get_actual_results(recent_predictions)
    mae = calculate_mae(recent_predictions, actual_results)
    if mae > 50:  # 误差超过阈值
        send_alert(f"销量预测 MAE 过高：{mae:.2f}")
    # 2. 推荐采纳率监控
    adoption_rate = calculate_adoption_rate(days=7)
    if adoption_rate < 0.4:  # 采纳率低于 40%
        send_alert(f"AI 推荐采纳率过低：{adoption_rate:.2%}")
    # 3. 选品成功率监控
    success_rate = calculate_success_rate(days=30)
    if success_rate < 0.3:  # 成功率低于 30%
        send_alert(f"选品成功率过低：{success_rate:.2%}")

版本: 1.0
创建时间: 2026-03-19
维护者: AI 团队