java - Process unstructured and multiple line CSV in hadoop -


I would like to process the data in the Hadoop Mapreduce, without unstructured, multiple lines and un-quotation quotations.

  2/1/2013 5:16, Edward Felton, 2,8 / 1/2012 3:57, weeks for working on some digital elements for our big event in Sydney ... For more travel, visit http://www.xy.com/au/geworks/, 324005862,2,18200695 12/28/2012 19:28, Laura McCullum, 2,7 / 26/2012 18:03, "You  

"http://youtu.be/qfq9LVD2Qr4" & gt; http://youtu.be/qfq9LVD2Qr4 & lt; br & gt; & lt; Br> If you 'like always want to destroy a cube!', 502114904,2,18400313 11/21/2012 13:35, Timothy Wiedson, 4.8 / 17/2012 12:38, " Is a table really a world of laptops With the new Windows tablet on the horizon and Apple / Android devices, I was wondering if it is possible to actually work with just the tablet. My mission: - A whole week I am working with Z Off my iPad hardware: - Apple iPad - Apple keyboard - Apple-HDMI enabled monitor for HDMI connector - IKAS iPad Stand :-) ", 105001439,1,19301609 2013/03/15 13: 43, Mary Romio, 3,8 / 16/2012 22:23, "How long have you posted the link Are you okay? Br> The attached image tells how to shorten a longer URL before posting it. There can be a small link to post 3-4 line URLs in 4 easy steps. ", 21302232 9, 19, 9 01561 11/30/2012 2:17, Lu Yin Zhong, 3,8 / 29/2012 1:29, working on the 2013 Como Plan ... big ideas are needed! , 302014449,2, 20300666 3/5/2013 22:15, Tim Stiegert, 12,8 / 2 9/2012 15:36, "Looking at 1024 email addresses. Manual? May be one day! Do this with SSOget, # [& amp; Quot; Excel & amp; Quot;]? 5 minutes! Saved attempt and # [& amp; Quot; Productivity & amp; Quot;] received? Priceless! Now go and enjoy it for yourself! :) & Lt; Br> Http: //sc.xy.com/*SSOget @@@ data @@@ {& amp; Quot; Image & amp; Quot ;: & amp; Quot; & Amp; Quot ;, & amp; Quot; Title & amp; Quot ;: & amp; Quot; & Amp; Quot;} ", 100011871,1120400713 11/1/2012 20:46, Prana Jain, 2,8 / 30/2012 14:26, people agree with iCloud restrictions that AirWatch has put on personal iOS devices email Will, 212065316,0,20700913 2012/11/09 18: 32, Monika Sharma, 5,9 / 7/2012 11:42, hhghghghgh hg gh gh gh gh ghghghghghghghghgh hg h gh gh gh gh ghghghghghghgh Hg h gh gh gh ghghghghghghgh hg h gh gh ghghghghghghgh hg h gh h gh ghghghghghghgh hg h gh h gh gh ghghghghghghgh hg h gh h gh gh ghghghghghghghghgh hg h gh gh gh gh ghghghghghghghghgh hg J gh gh, 502000192,5,21400516

Let me handle the code snippet to handle the data. Thank you in advance !! !!!!!

Because you're e multi-line data You can not use a simple TextInputFormat to use your data to counter it. This way you can use a custom InputFormat for CSV files the wanted.

There is currently no inherent way of processing multi-line CSV files in the Hadoop, but luckily you can try code on Zitub.

As far as the quotation of the non-expired is concerned, it may be necessary to pre-process the data and clean it in the first place. If there is no separator before or after the quote ( "), then a simple rule will be to avoid the quote:

  • Survive: a" b => a \ "b
  • Skip unchanged: a;" B and a
  • Another option would correctly correct the application producing invalid CSV to avoid data. / P>


Comments