hadoop - flume taking time to copy data into hdfs when rolling based on file size -


i have usecase want copy remote file hdfs using flume. want copied files should align hdfs block size (128mb/256mb).total size of remote data 33gb.

i using avro source , sink copy remote data hdfs. sink side doing file size rolling(128,256).but copying file remote machine , storing hdfs(file size 128/256 mb) flume taking avg of 2 min.

flume configuration: avro source(remote machine)

### agent1 - spooling directory source , file channel, avro sink  ### # name components on agent agent1.sources = spooldir-source   agent1.channels = file-channel agent1.sinks = avro-sink  # describe/configure source agent1.sources.spooldir-source.type = spooldir agent1.sources.spooldir-source.spooldir =/home/benchmarking_simulation/test   # describe sink agent1.sinks.avro-sink.type = avro agent1.sinks.avro-sink.hostname = xx.xx.xx.xx   #ip address destination machine agent1.sinks.avro-sink.port = 50000  #use channel buffers events in file agent1.channels.file-channel.type = file agent1.channels.file-channel.checkpointdir = /home/flume_checkpoint_dir/ agent1.channels.file-channel.datadirs = /home/flume_data_dir/ agent1.channels.file-channel.capacity = 10000000 agent1.channels.file-channel.transactioncapacity=50000  # bind source , sink channel agent1.sources.spooldir-source.channels = file-channel agent1.sinks.avro-sink.channel = file-channel 

avro sink(machine hdfs running)

### agent1 - avro source , file channel, avro sink  ### # name components on agent agent1.sources = avro-source1   agent1.channels = file-channel1 agent1.sinks = hdfs-sink1  # describe/configure source agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = xx.xx.xx.xx agent1.sources.avro-source1.port = 50000  # describe sink agent1.sinks.hdfs-sink1.type = hdfs agent1.sinks.hdfs-sink1.hdfs.path =/user/benchmarking_data/multiple_agent_parallel_1 agent1.sinks.hdfs-sink1.hdfs.rollinterval = 0 agent1.sinks.hdfs-sink1.hdfs.rollsize = 130023424 agent1.sinks.hdfs-sink1.hdfs.rollcount = 0 agent1.sinks.hdfs-sink1.hdfs.filetype = datastream agent1.sinks.hdfs-sink1.hdfs.batchsize = 50000 agent1.sinks.hdfs-sink1.hdfs.txneventmax = 40000 agent1.sinks.hdfs-sink1.hdfs.threadspoolsize=1000 agent1.sinks.hdfs-sink1.hdfs.appendtimeout = 10000 agent1.sinks.hdfs-sink1.hdfs.calltimeout = 200000   #use channel buffers events in file agent1.channels.file-channel1.type = file agent1.channels.file-channel1.checkpointdir = /home/flume_check_point_dir agent1.channels.file-channel1.datadirs = /home/flume_data_dir agent1.channels.file-channel1.capacity = 100000000 agent1.channels.file-channel1.transactioncapacity=100000   # bind source , sink channel agent1.sources.avro-source1.channels = file-channel1 agent1.sinks.hdfs-sink1.channel = file-channel1 

network connectivity between both machine 686 mbps.

can please me identify whether wrong in configuration or alternate configuration copying doesn't take of time.

both agents use file channel. before writing hdfs, data has been written disk twice. can try use memory channel each agent see if performance improved.


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -