报告题目:A distance metric-based uniform subsampling method for nonparametric models
摘 要:Taking subset samples from the original data set is an efficient and popular strategy to handle massive data that is too large to be directly modeled. While in most theoretical advances the subset samples are collected independently with equal probability, for accurate inference and prediction it is beneficial to employ a fast subsampling scheme to select observations intelligently. The majority of existing subset selecting methods are provided for linear models or their extensions. In this paper, we propose a novel subsampling method for efficient nonparametric modeling of high-volume data sets. It is a proportionate sampling method that uses distance metric-based strata to select subsamples. To minimize the maximal distance from pairs of samples that locate in the same stratum, Voronoi cells of thinnest covering lattices are used to partition the space. With the help of an algorithm to quickly identify the cell an observation locates in, the computational cost of our subsampling method is proportional to the number of observations and irrelevant to the number of cells, which makes our method applicable to extremely large data sets. Results from simulated studies and real data analysis show that the new method is remarkably better than existing approaches when used in conjunction with $k$-nearest neighbor or Gaussian process models.
报告时间:2024年11月26日15:00-16:00
报告地点:统计与数据科学学院109教室
专家简介:何煦,本科毕业于北京大学数学科学学院,博士毕业于威斯康星大学麦迪逊分校统计系。2012年博士毕业后入职中国科学院数学与系统科学研究院系统科学研究所,现任副研究员。主要研究方向包括实验设计,特别是计算机仿真实验的设计及分析。在Annals of Statistics,Journal of the American Statistical Association,Biometrika,Journal of Machine Learning Research等杂志发表论文十余篇,曾主持国家自然科学基金委优秀青年基金、面上项目、青年基金。