做海报的网站小白qq,洛阳万悦网站建设,android studio官网下载,网站建设需要考虑什么因素datax 支持多种数据源的相互读写#xff0c;作为开源软件#xff0c;提供了离线采集功能#xff0c;方便系统开发#xff0c;过程中遇到诸多配置#xff0c;需要开发者自己探索#xff0c;免费同样有成本
配置模板
{setting: {},job: {s…datax 支持多种数据源的相互读写作为开源软件提供了离线采集功能方便系统开发过程中遇到诸多配置需要开发者自己探索免费同样有成本
配置模板
{setting: {},job: {setting: {speed: {channel: 2}},content: [{reader: {name: txtfilereader,parameter: {path: [/data/test/test.txt],encoding: UTF-8,column: [{index: 0,type: string},{index: 1,type: string}],fieldDelimiter: \t}},writer: {name: hdfswriter,parameter: {defaultFS: hdfs://****:9000,fileType: TEXT,path: /user/hive/warehouse/sz_center_devdb.db/cat,fileName: catfile,column: [{name: cat_id,type: STRING},{name: cat_name,type: STRING}],writeMode: append,fieldDelimiter: \t,compress:NONE}}}]}
}
注意文本文件需要上传到datax 所在服务器
执行报错一
Hadoop 权限异常 Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: userdefault, accessWRITE, inode/user/hive/warehouse/sz_center_devdb.db:anonymous:supergroup:drwxr-xr-xat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:336)at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:241)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1909)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1893)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1852)at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:323)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2635)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2577)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:532)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1020)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:948)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2952)at org.apache.hadoop.ipc.Client.call(Client.java:1476)at org.apache.hadoop.ipc.Client.call(Client.java:1407)at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)at com.sun.proxy.$Proxy9.create(Unknown Source)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)at com.sun.proxy.$Proxy10.create(Unknown Source)at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1623)... 18 more这里是因为Hadoop 目录没有权限。 这里执行的用户是default dataX 模板中没有配置用户的地方这里先去Hadoop 配置目录权限
Hadoop 目录权限配置
hdfs dfs -ls /
hdfs dfs -mkdir /user
hdfs dfs -mkdir /hbase
hdfs dfs -ls /
hadoop fs -chmod 777 /user
hadoop fs -chmod 777 /hbase
# 循环所有子目录配置权限
hadoop fs -chmod -R 777 /hbase然后运行dataX 任务成功。 但从hive 链接中发现数据乱码,这里就是 hive的文件类型和分隔符不一致导致 这里回顾日志发现读取文本异常
[WI-0][TI-0] - [INFO] 2024-11-06 16:16:36.229 0800 o.a.d.p.t.a.AbstractTask:[169] - - 2024-11-06 16:16:35.230 [0-0-0-reader] INFO TxtFileReader$Task - reading file : [/data/test/test.txt]2024-11-06 16:16:35.231 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started2024-11-06 16:16:35.268 [0-0-0-writer] INFO HdfsWriter$Task - begin do write...2024-11-06 16:16:35.268 [0-0-0-writer] INFO HdfsWriter$Task - write to file : [hdfs://10.80.18.165:9000/user/hive/warehouse/sz_center_devdb.db/cat__f395492b_e42a_47e5_a52b_214ab8bf833a/catfile__d369974c_fdeb_4601_b118_67ae6e97e197]2024-11-06 16:16:35.341 [0-0-0-reader] INFO UnstructuredStorageReaderUtil - CsvReader使用默认值[{captureRawRecord:true,columnCount:0,comment:#,currentRecord:-1,delimiter:\t,escapeMode:1,headerCount:0,rawRecord:,recordDelimiter:\u0000,safetySwitch:false,skipEmptyRecords:true,textQualifier:\,trimWhitespace:true,useComments:false,useTextQualifier:true,values:[]}],csvReaderConfig值为[null]2024-11-06 16:16:35.351 [0-0-0-reader] WARN UnstructuredStorageReaderUtil - 您尝试读取的列越界,源文件该行有 [1] 列,您尝试读取第 [2] 列, 数据详情[1 hello]2024-11-06 16:16:35.356 [0-0-0-reader] ERROR StdoutPluginCollector - 脏数据: {message:您尝试读取的列越界,源文件该行有 [1] 列,您尝试读取第 [2] 列, 数据详情[1 hello],record:[{byteSize:7,index:0,rawData:1 hello,type:STRING}],type:reader}2024-11-06 16:16:35.357 [0-0-0-reader] WARN UnstructuredStorageReaderUtil - 您尝试读取的列越界,源文件该行有 [1] 列,您尝试读取第 [2] 列, 数据详情[2 cat]2024-11-06 16:16:35.357 [0-0-0-reader] ERROR StdoutPluginCollector - 脏数据: {message:您尝试读取的列越界,源文件该行有 [1] 列,您尝试读取第 [2] 列, 数据详情[2 cat],record:[{byteSize:5,index:0,rawData:2 cat,type:STRING}],type:reader}2024-11-06 16:16:35.793 [0-0-0-writer] INFO HdfsWriter$Task - end do write2024-11-06 16:16:35.841 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[623]ms2024-11-06 16:16:35.841 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed its tasks.
[WI-0][TI-0] - [INFO] 2024-11-06 16:16:45.231 0800 o.a.d.p.t.a.AbstractTask:[169] - - 2024-11-06 16:16:45.222 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 12 bytes | Speed 1B/s, 0 records/s | Error 2 records, 12 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%2024-11-06 16:16:45.222 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.2024-11-06 16:16:45.223 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do post work.2024-11-06 16:16:45.224 [job-0] INFO HdfsWriter$Job - start rename file [hdfs://10.80.18.165:9000/user/hive/warehouse/sz_center_devdb.db/cat__f395492b_e42a_47e5_a52b_214ab8bf833a/catfile__d369974c_fdeb_4601_b118_67ae6e97e197] to file [hdfs://10.80.18.165:9000/user/hive/warehouse/sz_center_devdb.db/cat/catfile__d369974c_fdeb_4601_b118_67ae6e97e197].
[WI-0][TI-0] - [INFO] 2024-11-06 16:16:46.231 0800 o.a.d.p.t.a.AbstractTask:[169] - - 暂时不确定是源文件格式问题 还是编码问题
或者是任务配置问题。
后续出结果后更新。
执行异常二 数据为空或者数据列不对应。 这种情况执行日志没有任何异常执行结果也是成功但是目标的hive 表里没有数据。
这时候就看hive的分隔符配置了。
如何查看hive表的分隔符
执行命令
show create table hello 查看 ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ # 默认分隔符行分割符“\n”列分割符“^A” 这个在data JSON 中还不能直接配置必须使用转义字符
默认存储格式textfile
JSON 里配置 TEXT
参考资料https://blog.csdn.net/mn525520/article/details/106876384 https://blog.csdn.net/u010520724/article/details/121999575 https://blog.csdn.net/qq_36039236/article/details/108101345
生效hive 建表语句、dataX json 任务配置参见
配置示例:www.fancv.com