[关闭]
@mShuaiZhao 2017-11-17T13:56:06.000000Z 字数 5980 阅读 427

Data mining

exam

01. Intro

02. Getting to Know your data

  1. Data objects and attribute types

    • Types of data sets
      • Record
      • Graph and network
      • Ordered
        • video data
      • Spatial, image and multimedia
    • important characteristics of structured data
      • dimensionality
      • sparsity
      • resolution 分辨率?分离度?
      • Distribution
    • Data objects
      • A data object represents an entity.
      • Also called samples, examples, instance, data points
      • Data objects are describled by attributes.
    • Attributes(or dimensions, features, variables)
      • Attribute Types
        • Nomial 名词性的
        • Ordinal 带顺序性质
          Values have a meaningful order.
        • Binary
          Nomial attribute with only 2 states.
          (Symmetric \ Asymmetric 重要程度不同)
          Convention: assign 1 to most important outcome(e.g.,HIV positive)
      • Numeric attribute types
        • Quantity(integer or real-valued)
        • Interval
          • Measured on a scale of equal-sized units
          • No true zero-point
        • Ratio
          • Inherent zero-point
      • Discrete vs. Continuous Attributes
    • Basic statistical Descriptions of data
      • measuring the central tendency
        • mean
        • mode
          value that occurs most frequently in the data.
          数据中出现频率最高的值.
        • Median 中位数
          Empirical fomula
      • Measuring the Dispersion of data
        • Quartiles, outliers and boxplots
          • Quartiles 四分之一点:
            Inter-quartile range:
            Five number summary: min, , median, , max
        • Variance and standard deviation
      • Boxplot Analysis
        • whiskers
          极大极小值
        • outliers
          a value higher/lower than
      • Properties of Normal Distribution Curve
    • Graphic displays
      • Histogram Analysis
        Histograms often tell more than boxplot.
      • Quantile plot
      • Quantile-Quantile (Q-Q) plot
      • Scatter plot
    • Data visulization
      • Pixel-Oriented visualization techiniques
        数据点的值与图像点的颜色值相对应。
      • Geometric projection visualization techiques
        • scatterplot Matrices
          k维数据扩展成k-by-k的矩阵,画scatter plot
        • Landscapes
        • parallel coordinates
          属性个数一个轴,属性值多个轴,属性值轴平行,相隔固定距离,每一个数据 点就是一条连接多个属性值轴的曲线。
      • Icon-Based visualization techniques
        • Chernoff Faces
          脸部的不同特征对应不同的属性。e.g.,头的偏心率、眼睛的大小、瞳孔的大 小等等。
          最终不同的数据点对应不同的脸。
        • Stick Figure
      • Hierarchical visualization techniques
        • Dimensional stacking
          堆叠(stack)或者说嵌套式地表示数据,比如说属性1属性2组成两个轴,在 这两个轴划分成的网格中,再嵌套属性3属性4两个轴组成的坐标系。
        • worlds-whthin-worlds
        • tree-map
        • InfoCube
        • Visualizing complex data and relations
    • Measuring data similarity dissimilirity

      • Similarity
        Numerical measure of how alike two data objects are.
        Often falls in the range [0, 1]
        越大越相似
      • Dissimilarity
        Numerical measure of how different two data objects are.
        越大越不相似
      • Proximity rfer to a similarity or dissimilarity
      • Data matrix
      • Dissimilarity matrix
        A triangular matrix.
        对称的,只需要三角阵的信息。
      • Proximity measure for nomial attributes
        • Simple matching

          是属性总数,是相同的属性数目。
      • Proximity measure for binary attributes
        • a contingency table for binary data, two obtect
      value 1 0
      1 q r
      0 s t

      * distance measure for symmetric variables


      * asymmetric

      * Jaccard coefficient( similarity measure for asymmetric binary variables)

      • Standardizing Numeric data

        • Z-score

          原始数据减去均值除以标准差。
          归一化后均值为0,标准差为1.
          • An alternative way:计算mean absolute deviation,替换标准差.
      • Minkowski distance 明科夫斯基距离
        就是范数。


        非负,对称,满足三角不等式.
        满足这三个性质的距离可以作为一个metric(度量指标).
        Special cases

        • 曼哈顿距离 Manhattan distance
          绝对值相加
        • 欧氏距离 Euclidean distance
          欧式几何距离
        • "supremum" distance,
          最大的那个差值占主要
      • Ordinal Variables
        order is important
        用实数代替顺序,再映射到[0,1]之间,最后计算不相似度。
      • Attributes of Mixed Type
        混合属性表示,二值、数字的、有顺序的,分别转换最后求不相似度。
      • Cosine similarity
    • Summary

03. Data Preprocessing

06. Mining Frequent patterns, association and correlations

basic concepts

Method

appendix

  1. words record
    • discipline n. 纪律,惩罚,学科 vt.训导
    • workshop 【计】专题研究组,讨论会
    • diversity n. 差异,多样性
    • Hierarchical adj. 分层的,分级的
  2. Conferences and Journals
    • Conf.
      • KDD
      • ICDM
    • Journal
      • TKDE
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注