zoukankan      html  css  js  c++  java
  • Dimensionality and high dimensional data: definition, examples, curse of..

    Dimensionality in statistics refers to how many attributes a dataset has. For example, healthcare data is notorious for having vast amounts of variables (e.g. blood pressure, weight, cholesterol level). In an ideal world, this data could be represented in a spreadsheet, with one column representing each dimension. In practice, this is difficult to do, in part because many variables are inter-related (like weight and blood pressure).

    Note: Dimensionality means something slightly different in other areas of mathematics and science. For example, in physics, dimensionality can usually be expressed in terms of fundamental dimensions like mass, time, or length. In matrix algebra, two units of measure have the same dimensionality if both statements are true:

    1. A function exists that maps one variable onto another variable.
    2. The inverse of the function in (1) does the reverse.

    High Dimensional Data

    High Dimensional means that the number of dimensions is staggeringly惊人地 high — so high that calculations become extremely difficult. With high dimensional data, the number of features can exceed the number of observations. For example, microarrays, which measure gene expression, can contain tens of hundreds of samples. Each sample can contain tens of thousands of genes.

    1. What is the dimension of time series. 

    Classification of time series is a somewhat tricky matter. Most classification algorithms have an implicit assumption that the data you are classifying are stationary, and they usually work in vector spaces.

    So there are two "things" that can be multidimensional here: your original time series and the result of your preprocessing before feeding data to a classifier.

    To answer your question straight: a time series is multidimensional if it is a measurement of more than one variable throughout time, it is not multidimensional because of its length.
     
    How would you go about classifying time series? Well, it depends on your intent, on the nature of the process you are measuring, etc. But in general terms, you will split your time series in small fragments and construct a multi-dimensional vector that represents each fragment, or you will fit a model (autoregressive, splines, whatever) and use the obtained parameters of the model as the vector representing that fragment. Additionally, you may synthesize new time series from the first one: derivatives, integratives, filtered time series, and build a truly multi-dimensional time series, that you will still need to preprocess.
     
    The key is that classifiers will, in general, not treat time explicitely, you have to hide the temporal dimension from your time series and find a way to encode it in a single vector.

    Supplementary knowledge:

    1. downsample.降采样

    2. curse of dimensionality维度灾难

    当维数提高时,空间的体积提高太快,因而可用数据变得很稀疏。稀疏性对于任何要求有统计学意义的方法而言都是一个问题,为了获得在统计学上正确并且有可靠的结果,用来支撑这一结果所需要的数据量通常随着维数的提高而呈指数级增长。

    wiki

    3. 缩写iid: independent and identically distributed random variables. 独立同分布.

    Reference:

    1. 时间序列数据(2)——维度篇

    2. What is meant by 'high dimensional' time series?

    3. 万物皆Embedding,从经典的word2vec到深度学习基本操作item2vec

  • 相关阅读:
    计算机组成原理实验总结
    Matlab图像匹配问题
    局域网实验
    信号量与共享存储区(操作系统实验三)
    路由器配置及IP设置及ping命令使用
    自我介绍是一门学问
    数据库管理系统的维护与管理
    高数讲课教后感
    node Unexpected token import(node 目前默认不支持es6 的模块 import解决方法有2)
    Cookie/Session机制详解
  • 原文地址:https://www.cnblogs.com/dulun/p/12232486.html
Copyright © 2011-2022 走看看