下载源码
svn checkout http://code.taobao.org/svn/datax/trunk
-环境
root@datanode158:~# java -version java version "1.7.0_45" root@datanode158:~# python -V Python 2.7.3 root@datanode158:~# ant -version Apache Ant(TM) version 1.8.2 compiled on December 3 2011 root@datanode158:~# g++ --version g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 root@datanode158:~# rpm --version RPM version 4.9.1.1 root@datanode158:~# dos2unix -V dos2unix 5.3.1 (2011-08-09) With native language support. LOCALEDIR: /usr/share/locale
步骤:
1、进入datax的rpm目录:/datax/rpm
在root下运行:rpmbuild --ba t_dp_datax_engine.spec
会出现一堆FileNotFound的问题
RPM build errors: File not found: /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64/home/taobao/datax/bin File not found: /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64/home/taobao/datax/conf File not found: /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64/home/taobao/datax/engine File not found: /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64/home/taobao/datax/common File not found: /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64/home/taobao/datax/libs File not found: /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64/home/taobao/datax/logs File not found: /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64/home/taobao/datax/jobs
开源也不好好修改下源码,搞到一头雾水=,=
修改t_dp_datax_engine.spec如下:
summary: engine provides core scheduler and data swap storage for DataX Name: t_dp_datax_engine Version: 1.0.0 Release: 1 Group: System License: GPL AutoReqProv: no BuildArch: noarch %define dataxpath /home/taobao/datax //改成%{buildroot}/home/taobao/datax %define vdataxpath /home/taobao/datax //添加,其中vdataxpath下面要用 %description DataX Engine provides core scheduler and data swap storage for DataX %prep cd ${OLDPWD}/../ export LANG=zh_CN.UTF-8 ant dist %build %install dos2unix ${OLDPWD}/../release/datax.py mkdir -p %{dataxpath}/bin mkdir -p %{dataxpath}/conf mkdir -p %{dataxpath}/engine mkdir -p %{dataxpath}/common mkdir -p %{dataxpath}/libs mkdir -p %{dataxpath}/jobs mkdir -p %{dataxpath}/logs cp ${OLDPWD}/../jobs/sample/*.xml %{dataxpath}/jobs cp ${OLDPWD}/../release/*.py %{dataxpath}/bin/ cp -r ${OLDPWD}/../conf/*.properties %{dataxpath}/conf cp -r ${OLDPWD}/../conf/*.xml %{dataxpath}/conf cp -r ${OLDPWD}/../build/engine/*.jar %{dataxpath}/engine cp -r ${OLDPWD}/../build/common/*.jar %{dataxpath}/common cp ${OLDPWD}/../c++/build/libcommon.so %{dataxpath}/common cp -r ${OLDPWD}/../libs/commons-io-2.0.1.jar %{dataxpath}/libs cp -r ${OLDPWD}/../libs/commons-lang-2.4.jar %{dataxpath}/libs cp -r ${OLDPWD}/../libs/dom4j-2.0.0-ALPHA-2.jar %{dataxpath}/libs cp -r ${OLDPWD}/../libs/jaxen-1.1-beta-6.jar %{dataxpath}/libs cp -r ${OLDPWD}/../libs/junit-4.4.jar %{dataxpath}/libs cp -r ${OLDPWD}/../libs/log4j-1.2.16.jar %{dataxpath}/libs cp -r ${OLDPWD}/../libs/slf4j-api-1.4.3.jar %{dataxpath}/libs cp -r ${OLDPWD}/../libs/slf4j-log4j12-1.4.3.jar %{dataxpath}/libs %post chmod -R 0777 %{dataxpath}/jobs //改成chmod -R 0777 %{vdataxpath}/jobs chmod -R 0777 %{dataxpath}/logs //改成chmod -R 0777 %{vdataxpath}/logs %files %defattr(0755,root,root) %{dataxpath}/bin // 改成%{vdataxpath}/bin %{dataxpath}/conf //改成%{vdataxpath}/conf %{dataxpath}/engine //改成%{vdataxpath}/engine %{dataxpath}/common //改成%{vdataxpath}/common %{dataxpath}/libs //改成%{vdataxpath}/libs %attr(0777,root,root) %dir %{dataxpath}/logs //改成%attr(0777,root,root) %{vdataxpath}/logs %attr(0777,root,root) %dir %{dataxpath}/jobs //改成 %attr(0777,root,root) %{vdataxpath}/jobs %changelog * Fri Aug 20 2010 meining - Version 1.0.0
再次编译
Processing files: t_dp_datax_engine-1.0.0-1.noarch Checking for unpackaged file(s): /usr/lib/rpm/check-files /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64 Wrote: /root/rpmbuild/SRPMS/t_dp_datax_engine-1.0.0-1.src.rpm Wrote: /root/rpmbuild/RPMS/noarch/t_dp_datax_engine-1.0.0-1.noarch.rpm Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.y3UwSl + umask 022 + cd /root/rpmbuild/BUILD + /bin/rm -rf /root/rpmbuild/BUILDROOT/t_dp_datax_engine-1.0.0-1.x86_64 + exit 0
进入:/root/rpmbuild/RPMS/noarch
发布
rpm -ivh t_dp_datax_engine-1.0.0-1.noarch.rpm
至此安装完成!
测试
root@datanode158:~/rpmbuild/RPMS/noarch# python /home/taobao/datax/bin/datax.py -e true Taobao DataX V1.0 Data Source List : 0 mysql 1 sqlserver 2 http 3 fake 4 stream 5 oracle 6 hdfs 7 hbase Please choose [0-7]: 2 Data Destination List : 0 stream 1 mysql 2 hdfs 3 oracle 4 hbase Please choose [0-4]: 0 Generate /home/taobao/datax/jobs/httpreader_to_streamwriter_1396012010274.xml successfully .
配置/home/taobao/datax/jobs/httpreader_to_streamwriter_1396012010274.xml
<?xml version="1.0" encoding="UTF-8"?> <jobs> <job id="httpreader_to_streamwriter_job"> <reader> <plugin>httpreader</plugin> <!-- default:; description:how to split url mandatory:false name:URLDelimiter --> <param key="urldelimiter" value=";"/> <!-- default:\t description:separator to split urls mandatory:false name:fieldSplit --> <param key="field_split" value="\t"/> <!-- default:UTF-8 range:UTF-8|GBK|GB2312 description:encode mandatory:false name:encoding --> <param key="encoding" value="UTF-8"/> <!-- default:\N description:replace this nullString to null mandatory:false name:nullString --> <param key="null_string" value="\N"/> <!-- range:legal http url description:url to fetch data mandatory:true name:httpURLs --> <param key="httpurls" value="http://www.baidu.com"/> <!-- default:1 range:1-100 description:concurrency of the job mandatory:false name:concurrency --> <param key="concurrency" value="1"/> </reader> <writer> <plugin>streamwriter</plugin> <!-- default:\t description:seperator to seperate field mandatory:false name:fieldSplit --> <param key="field_split" value="\t"/> <!-- default:UTF-8 range:UTF-8|GBK|GB2312 description:stream encode mandatory:false name:encoding --> <param key="encoding" value="UTF-8"/> <!-- range: description:print result with prefix mandatory:false name:prefix --> <param key="prefix" value="baidu"/> <!-- default:true range: description:print the result mandatory:false name:print --> <param key="print" value="true"/> <!-- default: range: description:replace null with the nullchar mandatory:false name:nullchar --> <param key="nullchar" value="hello"/> <!-- default:1 range:1 description:concurrency of the job mandatory:false name:concurrency --> <param key="concurrency" value="1"/> </writer> </job> </jobs>
自动生成的 xml 文件中,有“?” 标识的 value 值,表示此处用户必须配置,其他地方的默认值用户可以根据自己需要作修改
执行
DataX 的运行命令如下: /home/taobao/datax/bin/datax.py job.xml 其中/home/taobao/datax/bin/datax.py 是 DataX 命令行的 python 封装,该执行脚本 是整个 DataX 的程序入口,Job.xml 是该 job 的配置文件。
如上述配置,该程序会将百度首页download下来
................百度页面html、js代码 2014-03-28 21:13:02,204 [main] INFO schedule.Engine - DataX Reader post work begins . 2014-03-28 21:13:02,204 [main] INFO schedule.Engine - DataX Reader post work ends . 2014-03-28 21:13:02,204 [main] INFO schedule.Engine - DataX Writers post work begins . 2014-03-28 21:13:02,205 [main] INFO schedule.Engine - DataX Writers post work ends . 2014-03-28 21:13:02,205 [main] INFO schedule.Engine - DataX job succeed . 2014-03-28 21:13:02,210 [main] INFO schedule.Engine - DataX starts work at : 2014-03-28 21:13:00 DataX ends work at : 2014-03-28 21:13:02 Total time costs : 2s Average byte speed : 26KB/s Average line speed : 1L/s Total transferred records : 1 Total discarded records : 0
相关推荐
编译好的 Clickhousereader Clickhousewriter 插件包 放在 datax\plugin 目录下 即可运行
Datax的源码编译,编译以后支持访问mysql8.0数据库,以及支持连接Clickhouse,从clickhouse读取数据,以及写入数据到clickhouse,当前编译版本已经在正式集群上采用,没有任何问题
datax的简单范例. 有一些解释,适用于初学者
阿里云 DATAX mongo数据导入 增强版 追加非空列 默认设置为0的 完善代码
datax的clickhouse读写插件 由开源产品修复, 适用于clickhouse读写插件|同步mysql,sqlserver等数据到clickhose中或者反向同步
【项目资源】: 包含前端、后端、移动开发、操作系统、人工智能、物联网、信息化管理、数据库、硬件开发...有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。 鼓励下载和使用,并欢迎大家互相学习,共同进步。
在Linux系统 下DATAX安装所需环境以及安装步骤,讲解详细,有开发实例!
3-1.阿里云DataWorks数据集成(DataX)架构&实践分享.pdf
阿里开源ETL工具DATAX
DataX doriswriter 插件,用于通过 DataX 同步其他数据源的数据到 Doris 中。(https://doris.apache.org/zh-CN/docs/ecosystem/datax?_highlight=datax#%E5%85%B3%E4%BA%8E-datax) DataX Web是在DataX之上开发的...
datax maven编译
阿里开源ETL工具DataX
阿里巴巴Datax离线同步方案到ElasticSearch(官方由于不维护,所以没有读取/写入到ElasticSearch的方案),这里给出了读取/写入方案插件代码,直接编译出Jar包放到Datax中即可。
1、datax3.0部署与验证 网址:https://blog.csdn.net/chenwewi520feng/article/details/130508837 介绍datax3.0功能与部署以及验证
datax读写MySQL8的插件,修改源码重新编译而来
基于DataX的数据同步任务调度工具,支持自定义定时任务,支持crontab表达式,支持自定义添加DataX数据同步任务。 附上项目github地址,以便安装使用:https://github.com/luoce/bt-ware-datasync-datax
1、说明:datax支持presto读取,datax规范开发插件,读取presto中数据,可以在presto中配置mysql、postgresql、es、hive等数据库的连接,通过datax执行关联查询,数据存入新的库 2、插件更新:解压文件 prestoreader...
ETL对比datax-nifi
springboot集成datax的demo,下载及能运行
datax的json范例--datax的简单范例. 有一些解释,适用于初学者