如果是自已建设的网站_那你在数据库想改成什么时间都可以.,网站目录编辑审核的注意事项,公司logo设计多少钱,公司网站建设案例主要整理了N多年前#xff08;2013年#xff09;学习CUDA的时候开始总结的知识点#xff0c;好长时间不写CUDA代码了#xff0c;现在LLM推理需要重新学习CUDA编程#xff0c;看来出来混迟早要还的。
1.闭扫描和开扫描 对于一个二元运算符 ⊕ \oplus ⊕和一个 n n n元… 主要整理了N多年前2013年学习CUDA的时候开始总结的知识点好长时间不写CUDA代码了现在LLM推理需要重新学习CUDA编程看来出来混迟早要还的。
1.闭扫描和开扫描 对于一个二元运算符 ⊕ \oplus ⊕和一个 n n n元输入数组 [ x 0 , x 1 , . . . , x n − 1 ] \left[ {{x_0},{x_1},...,{x_{n - 1}}} \right] [x0,x1,...,xn−1]。如果返回输出数组为 [ x 0 , ( x 0 ⊕ x 1 ) , . . . , ( x 0 ⊕ x 1 ⊕ … ⊕ x n − 1 ) ] \left[ {{x_0},\left( {{x_0} \oplus {x_1}} \right),...,\left( {{x_0} \oplus {x_1} \oplus \ldots \oplus {x_{n - 1}}} \right)} \right] [x0,(x0⊕x1),...,(x0⊕x1⊕…⊕xn−1)]那么是闭扫描如果返回输出数组为 [ 0 , x 0 , ( x 0 ⊕ x 1 ) , . . . , ( x 0 ⊕ x 1 ⊕ … ⊕ x n − 2 ) ] \left[ {0,{x_0},\left( {{x_0} \oplus {x_1}} \right),...,\left( {{x_0} \oplus {x_1} \oplus \ldots \oplus {x_{n - 2}}} \right)} \right] [0,x0,(x0⊕x1),...,(x0⊕x1⊕…⊕xn−2)]那么是开扫描。串行闭扫描算法如下所示
/*** x: input array* y: output array*/
void sequential_scan(float* x, float* y, int Max_i) {y[0] x[0];for (int i1; iMax_i; i) {y[i] y[i-1] x[i];}
}说明在闭扫描的输出和开扫描的输出间互相转换还是比较简单的只需要移一次位并填上一个元素即可。 1从闭扫描转换到开扫描只需把所有元素右移0号元素填0值。 2从开扫描转换到闭扫描需要把所有元素向左移动一位最后一个元素填充原来最后一个元素和输入数组的最后一个元素之和。 说明假设输入数组为[3, 1, 7, 0, 4, 1, 6, 3]闭操作输出数组为[3, 4, 11, 11, 15, 16, 22, 25]开操作输出数组为[0, 3, 4, 11, 11, 15, 16, 22]可以验证。
2.简单并行扫描 在实践中并行扫描常常作为一些并行算法的原始操作比如基数排序、快速排序、字符串比较、多项式求值、递归求解、树操作和直方图等。 解析 1__syncthreads()确保所有线程在开始下一次迭代之前完成归约树中当前这次迭代的加法。 2inclusive scan表示闭扫描部分而exclusive scan表示开扫描部分。 说明除了简单并行扫描外还有工作高效的并行扫描任意输入长度的并行扫描。
3.Thrust与CUDA的互操作性 解析Thrust与CUDA的互操作性有利于迭代开发策略比如使用Thrust库快速开发出并行应用的原型确定程序瓶颈使用CUDA C实现特定算法并作必要优化。When a Thrust function is called, it inspects the type of the iterator to determine whether to use a host or a device implementation. This process is known as static dispatching since the host/device dispatch is resolved at compile time. Note that this implies that there is no runtime overhead to the dispatch process. 1Thrust到CUDA的互操作性
size_t N 1024;
device_vectorint d_vec(N);
int raw_ptr raw_pointer_cast(d_vec[0]);
cudaMemset(raw_ptr, 0, N*sizeof(int));
my_kernel N / 128, 128 (N, raw_ptr);说明通过raw_pointer_cast()将设备地址转换为原始C指针原始C指针可以调用CUDA C API函数或者作为参数传递到CUDA C kernel函数中。 2CUDA到Thrust的互操作性
size_t N 1024;
int raw_ptr;
cudaMalloc(raw_ptr, N*sizeof(int));
device_ptrint dev_ptr device_pointer_cast(raw_ptr);
sort(dev_ptr, dev_ptrN);
dev_ptr[0] 1;
cudaFree(raw_ptr);说明通过device_pointer_cast()将原始C指针转换为设备向量的设备指针以便访问Thrust库中的算法。
4.GPUSMSP与GridBlockThread之间的映射关系 解析GPU的任务分配单元将Grid分配到GPU芯片上。任务分配单元使用轮询策略将Block分配到SM上决定能否分配的因素包括每个Block使用的共享存储器数量每个Block使用的寄存器数量以及其它的一些限制条件。SM中的线程调度单元又将分配到的Block进行细分将其中的线程组织成线程束WarpBlock中的每一个Thread被发射到一个SP上。一个SM可以同时处理多个Block比如现在有16个SM、64个Block、每个SM可以同时处理3个Block那么设备刚开始的时候就会同时处理48个Block剩下的16个Block等待SM。一个SM一次只会执行一个Block中的一个Warp但是SM遇到正在执行的Warp需要等待的时候比如存取Global Memory等就切换到别的Warp继续做运算。
5.固定内存pinned memory 解析malloc()分配的是可分页的主机内存而cudaHostAlloc()分配的是页锁定的主机内存也称固定内存pinned memory它的一个重要特点是操作系统不会对这块内存分页并交换到磁盘上从而保证了这块内存不会被破坏或者重新定位。
6.CUDA 7.5和cuDNN 5.0安装 解析 1解压缩会生成cuda/include、cuda/lib、cuda/bin三个目录 2分别将cuda/include、cuda/lib、cuda/bin三个目录中的内容拷贝到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5对应的include、lib、bin目录下。 说明CUDA 8.0对应的cuDNN 5.0和CUDA 7.5对应的cuDNN 5.0是不一样的。
7.NVIDIA Deep Learning SDK 解析
1Deep Learning Primitives (cuDNN): High-performance building blocks for deep neural network applications including convolutions, activation functions, and tensor transformations.
2Deep Learning Inference Engine (TensorRT): High-performance deep learning inference runtime for production deployment.
3Deep Learning for Video Analytics (DeepStream SDK): High-level C API and runtime for GPU-accelerated transcoding and deep learning inference.
4Linear Algebra (cuBLAS): GPU-accelerated BLAS functionality that delivers 6x to 17x faster performance than CPU-only BLAS libraries.
5Sparse Matrix Operations (cuSPARSE): GPU-accelerated linear algebra subroutines for sparse matrices that deliver up to 8x faster performance than CPU BLAS (MKL), ideal for applications such as natural language processing.
6Multi-GPU Communication (NCCL): Collective communication routines, such as all-gather, reduce, and broadcast that accelerate multi-GPU deep learning training on up to eight GPUs.
7NVIDIA DIGITSInteractively manage data and train deep learning models for image classification, object detection, and image segmentation without the need to write code.
说明Fast Fourier Transforms (cuFFT)Dense and Sparse Direct Solvers (cuSOLVER)Random Number Generation (cuRAND)Image Video Processing Primitives (NPP)NVIDIA Graph Analytics Library (nvGRAPH)Templated Parallel Algorithms Data Structures (Thrust)CUDA Math Library.8.istream_iterator和ostream_iterator 解析 1template class T, class charTchar, class traitschar_traits class ostream_iterator;
#include iostream // std::cout
#include iterator // std::ostream_iterator
#include vector // std::vector
#include algorithm // std::copyint main () {std::vectorint myvector;for (int i1; i10; i) myvector.push_back(i*10);std::ostream_iteratorint out_it (std::cout, , );std::copy ( myvector.begin(), myvector.end(), out_it );return 0;
}2template class T, class charTchar, class traitschar_traits, class Distance ptrdiff_t class istream_iterator;
#include iostream // std::cin, std::cout
#include iterator // std::istream_iterator
using namespace std;int main() {double value1, value2;std::cout Please, insert two values: ;std::istream_iteratordouble eos; // end-of-stream iteratorstd::istream_iteratordouble iit(std::cin); // stdin iteratorif (iit ! eos){cout *eos endl;cout *iit endl;cout test1 endl;value1 *iit;}iit;if (iit ! eos){cout *eos endl;cout *iit endl;cout test2 endl;value2 *iit;}std::cout value1 * value2 (value1*value2) \n;return 0;
}9.__host__ __device__ int foo(int a){} 解析host int foo(int a){}表示由CPU调用的函数。device int foo(int a){}表示由GPU调用的函数。host__和__device__关键字可以连用比如__host __device__ int foo(int a){}会被编译成两个版本分别可以由CPU和GPU调用。
10.SAXPY 解析SAXPYScalar Alpha X Plus Y是一个在Basic Linear Algebra SubprogramsBLAS数据包中的函数并且是一个并行向量处理机vector processor中常用的计算操作指令。SAXPY是标量乘法和矢量加法的组合yaxy其中a是标量x和y矢量。
struct saxpy_functor
{ const float a; saxpy_functor(float _a) : a(_a) {} __host__ __device__ float operator()(const float x, const float y) const { return a * x y; }
}; void saxpy_fast(float A, thrust::device_vectorfloat X, thrust::device_vectorfloat Y)
{ // Y - A * X Y thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A));
} void saxpy_slow(float A, thrust::device_vectorfloat X, thrust::device_vectorfloat Y
{ thrust::device_vectorfloat temp(X.size()); // temp - A thrust::fill(temp.begin(), temp.end(), A); // temp - A * X thrust::transform(X.begin(), X.end(), temp.begin(), temp.begin(), thrust::multipliesfloat()); // Y - A * X Y thrust::transform(temp.begin(), temp.end(), Y.begin(), Y.begin(), thrust::plusfloat());
}说用仿函数(functor)就是使一个类的使用看上去像一个函数。其实现就是类中实现一个operator()这个类就有了类似函数的行为就是一个仿函数类了。
11.Thrust中的Transformations转换 解析 1thrust::fill 2thrust::sequence 3thrust::replace 4thrust::transform 5thrust::negate 6thrust::modulus 7thrust::zip_iterator 8thrust::for_each
12.Thrust中的Reductions规约 1thrust::reduce 2thrust::count 3thrust::count_if 4thrust::min_element 5thrust::max_element 6thrust::is_sorted 7thrust::inner_product 8thrust::transform_reduce 9transform_inclusive_scan 10transform_exclusive_scan
13.初始化thrust::device_vector 解析
float x[4] { 1.0, 2.0, 3.0, 4.0 };
thrust::device_vectorfloat d_x(x, x 4);
for (int i 0; i d_x.size(); i)cout d_x[i] endl; 14.templatetypename T struct thrust::plus T 解析
#include thrust/functional.h。
int sum thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plusint());
float norm std::sqrt(thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op));15.cudaDeviceReset 解析重置当前线程所关联过的当前设备的所有资源。
16.CUDART_VERSION 解析CUDA 7.5版本的CUDART_VERSION为7050包含在头文件#includecuda_runtime_api.h中。
17.thrust::count 解析thrust:count函数原型如下所示
templatetypename InputIterator , typename EqualityComparable
thrust::iterator_traitsInputIterator::difference_type thrust::count (
InputIterator first,
InputIterator last,
const EqualityComparable value
)说明count returns the number of iterators i in [first, last) such that *i value.
18.transform_reduce 解析transform_reduce函数原型如下所示
templatetypename InputIterator , typename UnaryFunction , typename OutputType , typename BinaryFunction
OutputType thrust::transform_reduce ( InputIterator first,
InputIterator last,
UnaryFunction unary_op,
OutputType init,
BinaryFunction binary_op
) 举个例子如下所示
#include thrust\transform_reduce.h
#include thrust\functional.h
#include thrust\device_vector.h
#include thrust\host_vector.h
#include cmath
using namespace std;
using namespace thrust;template typename T
struct square
{__host__ __device__T operator()(const T x) const {return x*x;}
};int main(void)
{float x[4] { 1.0, 2.0, 3.0, 4.0 };device_vectorfloat d_x(x, x 4);squarefloat unary_op;thrust::plusfloat binary_op;float init 10;float norm thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op);cout norm endl;return 0;
}19.Prefix-Sumsinclusive_scan和exclusive_scan 解析
#include thrust\scan.h
#include thrust\device_vector.h
#include thrust\host_vector.h
using namespace std;
using namespace thrust;int main(void)
{int data[6] { 1, 0, 2, 2, 1, 3 };// data is now {1, 1, 3, 5, 6, 9}// data[2] data[0] data[1] data[2]// thrust::inclusive_scan(data, data 6, data);// data is now {0, 1, 1, 3, 5, 6}// data[2] data[0] data[1]thrust::exclusive_scan(data, data 6, data);for (int i 0; i 6; i){cout data[i] endl;}return 0;
}20.thrust::sort和thrust::stable_sort 解析thrust::stable_sort函数原型如下所示
templatetypename DerivedPolicy , typename RandomAccessIterator , typename StrictWeakOrdering
__host__ __device__ void thrust::stable_sort (
const thrust::detail::execution_policy_base DerivedPolicy exec,
RandomAccessIterator first,
RandomAccessIterator last,
StrictWeakOrdering comp
) 1execThe execution policy to use for parallelization. 2firstThe beginning of the sequence. 3lastThe end of the sequence. 4compComparison operator. 举个例子如下所示
#include thrust\sort.h
using namespace std;
using namespace thrust;int main(void)
{const int N 6;int A[N] { 1, 4, 2, 8, 5, 7 };// A is now {1, 2, 4, 5, 7, 8}// thrust::sort(A, A N);// A is now {1, 2, 4, 5, 7, 8}thrust::stable_sort(A, A N);for (int i 0; i 6; i){cout A[i] endl;}return 0;
}1#include thrust/functional.h操作的函数对象和工具。 2#include thrust/execution_policy.hThrust执行策略。
21.thrust::sort_by_key和thrust::stable_sort_by_key 解析
#include thrust\sort.h
using namespace std;
using namespace thrust;int main(void)
{const int N 6;int keys[N] { 1, 4, 2, 8, 5, 7 };char values[N] { a, b, c, d, e, f };// keys is now { 1, 2, 4, 5, 7, 8} // values is now {a, c, b, e, f, d}// thrust::sort_by_key(keys, keys N, values);// keys is now { 1, 2, 4, 5, 7, 8} // values is now {a, c, b, e, f, d}thrust::stable_sort_by_key(keys, keys N, values);for (int i 0; i 6; i){cout values[i] endl;}return 0;
}22.Thrust中的Iterator 解析 1constant_iterator 2counting_iterator
#include thrust\iterator\constant_iterator.h
#include thrust/iterator/counting_iterator.h
#include thrust\reduce.h
#include iostream
using namespace std;
using namespace thrust;int main(void)
{thrust::constant_iteratorint first(10);thrust::constant_iteratorint last first 3;// returns 30 (i.e. 3 * 10)// thrust::reduce(first, last);// returns 33 (i.e. 10 11 12) thrust::reduce(first, last); cout thrust::reduce(first, last) endl;return 0;
}3transform_iterator
#include thrust\iterator\transform_iterator.h
#include thrust\device_vector.h
#include iostream
using namespace std;
using namespace thrust;int main(void)
{thrust::device_vectorint vec(3);vec[0] 10;vec[1] 20;vec[2] 30;// returns -60 (i.e. -10 -20 -30)cout thrust::reduce(thrust::make_transform_iterator(vec.begin(), thrust::negateint()),thrust::make_transform_iterator(vec.end(), thrust::negateint())) endl;return 0;
}4permutation_iterator
#include thrust\iterator\permutation_iterator.h
#include thrust\device_vector.h
#include iostream
using namespace std;
using namespace thrust;int main(void)
{thrust::device_vectorint map(4);map[0] 3;map[1] 1;map[2] 0;map[3] 5;thrust::device_vectorint source(6);source[0] 10;source[1] 20;source[2] 30;source[3] 40;source[4] 50;source[5] 60;// sum source[map[0]] source[map[1]] ...int sum thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),thrust::make_permutation_iterator(source.begin(), map.end()));cout sum endl;return 0;
}5zip_iterator
#include thrust\iterator\zip_iterator.h
#include thrust\device_vector.h
#include iostream
using namespace std;
using namespace thrust;int main(void)
{thrust::device_vectorint A(3);thrust::device_vectorchar B(3);A[0] 10; A[1] 20; A[2] 30; B[0] x; B[1] y; B[2] z;thrust::maximum thrust::tupleint, char binary_op;thrust::tupleint, char init thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin()))[0];thrust::tupleint, char result thrust::reduce(thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin())), thrust::make_zip_iterator(thrust::make_tuple(A.end(), B.end())), init, binary_op);cout thrust::get0(result) endl;cout thrust::get1(result) endl;return 0;
}23.#includestdlib.h 解析 1#define EXIT_SUCCESS 0 2#define EXIT_FAILURE 1
24.cuBLAS与CUBLASXT 解析在CUDA 6的开发包中提供了一个新的API——CUBLASXT它是在cuBLAS API的上层封装了一个矩阵分块算法解决了当数据量大时显存不足的问题。
25.cuRAND库 解析cuRAND库提供了通过GPU生成随机数的接口包含头文件#include curand_kernel.h。
26.CUDA同步方式 解析在CUDA中有两种方式实现同步如下所示 1System-level等待所有host和device的工作完成。 2Block-level等待device中block的所有thread执行到某个点。
27.CUDA中的on-chip和off-chip内存 解析 1共享内存share memory是on-chip。局部内存local memory和全局内存global memory是off-chip。它为一个线程块block中所有线程共享。 2局部内存local memory是全局内存global memory中划出的一部分。它为一个线程网格grid中的所有线程共享。
28.CUDA内存管理
1cudaError_t cudaMalloc(void** devPtr, size_t count);2cudaError_t cudaMallocPitch(void** devPtr, size_t* pitch, size_t widthInBytes, size_t height);3cudaError_t cudaFree(void* devPtr);4cudaError_t cudaMallocArray(struct cudaArray** array, const struct cudaChannelFormatDesc* desc, size_t width, size_t height);5cudaError_t cudaFreeArray(struct cudaArray* array);6cudaError_t cudaMallocHost(void** hostPtr, size_t size);page-locked7cudaError_t cudaFreeHost(void* hostPtr);8cudaError_t cudaMemset(void* devPtr, int value, size_t count);9cudaError_t cudaMemset2D(void* dstPtr, size_t pitch, int value, size_t width, size_t height)10cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind);
说明kind可以是cudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHost或cudaMemcpyDeviceToDevice。11cudaError_t cudaMemcpyAsync(void* dst,constvoid*src, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream);
说明它只能应用于page-locked的主机内存。12cudaError_t cudaMemcpy2D(void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind);13cudaError_t cudaMemcpy2DAsync(void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind, cudaStream_t stream);
说明dpitchPitch of destination memoryspitchPitch of source memory。14cudaError_t cudaMemcpyToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t count, enum cudaMemcpyKind kind);15cudaError_t cudaMemcpyToArrayAsync(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream);
说明拷贝count字节从src指向的内存区域到dstArray指向的CUDA数组从数组的左上角dstX, dstY开始。16cudaError_t cudaMemcpy2DToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind);17cudaError_t cudaMemcpy2DToArrayAsync(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind, cudaStream_t stream);
说明拷贝一个矩阵从src指向的内存区域到dstArray指向的CUDA数组从数组的左上角dstX, dstY开始。spitch是由src指向的2D数组中的内存宽度字节2D 数组中每行的最后包含自动填充的数值。18cudaError_t cudaMemcpyFromArray(void* dst, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t count, enum cudaMemcpyKind kind);19cudaError_t cudaMemcpyFromArrayAsync(void* dst, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream);
说明拷贝count字节从srcArray指向的CUDA数组从数组的左上角srcX, srcY开始到dst指向内存区域。20cudaError_t cudaMemcpy2DFromArray(void* dst, size_t dpitch, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t width, size_t height, enum cudaMemcpyKind kind);21cudaError_t cudaMemcpy2DFromArrayAsync(void* dst, size_t dpitch, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t width, size_t height, enum cudaMemcpyKind kind, cudaStream_t stream);
说明拷贝一个矩阵从srcArray指向的CUDA数组从数组的左上角srcX, srcY开始到dst指向的内存区域。dpitch是由dst指向的2D数组中的内存宽度字节2D数组中每行的最后包含自动填充的数值。22cudaError_t cudaMemcpyArrayToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t count, enum cudaMemcpyKind kind);
说明拷贝count字节从srcArray指向的CUDA数组从数组的左上角srcX, srcY开始到dstArray指向的CUDA数组从数组的左上角dstX, dstY。23cudaError_t cudaMemcpyArrayToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t width, size_t height, enum cudaMemcpyKind kind);
说明拷贝一个矩阵从srcArray指向的CUDA数组从数组的左上角srcX, srcY开始到dstArray指向的CUDA数组从数组的左上角dstX, dstY。24templateclass T cudaError_t cudaMemcpyToSymbol(const T symbol, const void* src, size_t count, size_t offset 0, enum cudaMemcpyKind kind cudaMemcpyHostToDevice);
说明拷贝count字节从src指向的内存区域到由符号symbol起始的offset字节指向的内存区域。symbol指向的是在device中的global或者constant memory。25templateclass T cudaError_t cudaMemcpyFromSymbol(void *dst, const T symbol, size_t count, size_t offset 0, enum cudaMemcpyKind kind cudaMemcpyDeviceToHost);
说明拷贝count字节从由符号symbol起始的offset字节指向的内存区域到dst指向的内存区域。symbol指向的是在device中的global或者constant memory。26templateclass T cudaError_t cudaGetSymbolAddress(void** devPtr, const T symbol);
说明返回设备中符号symbol的地址*devPtr。27templateclass T cudaError_t cudaGetSymbolSize(size_t* size, const T symbol );
说明返回符号symbol大小的地址*size。参考文献 [1] CUDA并行算法系列之规约http://blog.5long.me/2016/algorithms-on-cuda-reduction/ [2] 大规模并行处理器编程实战第2版 [3] CUDA之深入理解threadIdxhttp://blog.csdn.net/canhui_wang/article/details/51730264 [4] Thrusthttp://docs.nvidia.com/cuda/thrust/index.html#abstract [5] GPU中的几个基本概念http://blog.sina.com.cn/s/blog_80ce3a550101lntp.html [6] Thrusthttp://docs.nvidia.com/cuda/thrust/index.html#axzz4aFPI7CYb [7] on-chip memory概念http://bbs.csdn.net/topics/340269551