Every year Hortonworks, together with Yahoo, put on the DataWorks / Hadoop Summit. It’s a 2-3 day conference dedicated to Big Data and its technologies. This year it was my time to visit the summit, so I’ve compiled a quick summary.
DWS17 kicked off with an epic (to say the least) laser show.
From the welcome keynote on Day 1, emphasis was on the Data itself. It’s not about Hadoop or the platform anymore, instead on how to create value from the data in your organisation or the data you have collected. That data is also the reason why the summit has been renamed from the “Hadoop Summit” to the “Dataworks Summit”. With the ability to process and use data in an entirely different way from times of old, new businesses will emerge from data.
“In today’s world, data is actually our product.”
Scott Gnau, the Chief Technical Officer at Hortonworks talked about how the Future of Enterprise is centered around four paradigms: Cloud computing, Artifical Intelligence, Internet of Things and Streaming Data. Many of these are already in use in organisations, Cloud computing especially. Artificial Intelligence, which in itself is a broader area, is getting a lot of traction as Machine Learning is becoming more accessible due to services like Microsoft Azure Machine Learning Studio.
As for the other keynotes on both mornings, the sponsor keynotes were a little hit-and-miss.
Delightfully, the last morning keynote on Day 1, by Dr. Barry Devlin, shook things up by outlining the fall of Capitalism, and how A.I will inevitably replace the factory worker. This is of course, if we continue on our present course. It was a very interesting take on the future of Big Data and life beyond it, considering the speed at which current and new technologies are developing. As technological progress increases at an exponential rate, a crash is almost inevitable. A some what morbid start to the summit you could say, but thankfully the presentation had a silver lining at the end — we are now at the turning point, where we can influence how the future turns out, and influence the steepness of the downward curve. Hopefully we are able to level it out and avoid Dr Devlin’s Skynet-esque future 🙂
Also on Day 2, the last keynote by Dr Rand Hindi, was a quick look into privacy issues in Cloud computing. With the introduction of personal voice-assistants like Amazon Alexa and Google Home, technology companies should be paying more and more thought to where consumers’ data is processed. Voice patterns are after all, just as unique as fingerprints.
This year, as the focus was on data itself, you could see that many of the Breakout sessions were showcases of implementation by different companies. BMW, Société Générale, Lloyds Bank, and Klarna all showed how they’d leveraged Hadoop in their Big Data journey. Data Science was also in a big role at DWS17, as many of the customer showcases and Breakout Sessions had a Data Science theme.
Live Long And Process
Looking at the agenda for the two days at DWS17, you could see one thing jump out — Hive. Specifically Hive with LLAP. This was evident in the number of Hive (and LLAP) -specific Breakout Sessions. Apache Hive has been with the HDP stack for forever, and has been a staple part of many of our POC architectures at Bilot. Back in 2016, the launch of the Hive 2.0 LLAP Tech Preview made a lot of people happy, as the query speeds of Hive 1.x lacked the required punch, as well as missing full ACID support. Now with the newest version of the platform, LLAP is a reality (GA), and all the many sessions at DWS17 indicated it’s a big deal. Query times are reduced by an order of magnitude, which is definitely something to be excited about.
LLAP also adds value to other newer technologies coming into the HDP stack. Druid, a new time-series optimised data store, can leverage LLAP’s parallel processing capabilities to speed up query times. I’m especially excited to test out Druid, as it will come bundled with HDP 2.6 and thus be deployable via Ambari blueprints. It’s currently in beta, but will hopefully mature quickly.
The Hortonworks Dataflow, powered by Apache NiFi, looked to be Hortonworks’ next big thing. Teradata for example has open sourced its new “data lake management software platform”, Kylo, which leverages NiFi for pipeline orchestration. Hortonworks’ DataFlow still requires a fair amount of infrastructure to run, but as its little brother miniFy (JVM-based version of NiFi) matures, I think the whole edge-node processing paradigm will take off in a completely different way. Especially when you can run NiFi on very resource-scarce systems.
But we’ll have to stay tuned.
HDP 2.6 and beyond
Funnily enough, the launch of the new major release of HDP and Ambari wasn’t hyped at DWS17 as much as I would have expected. Granted, there was a fair amount of buzz around its new features, but the focus definitely was elsewhere. That being said, it didn’t mean that the announcement wasn’t important. Many of the new, cool features are only available with HDP 2.6 and Ambari 2.5, so users will need to upgrade their existing systems to leverage LLAP and Druid, for example. I for one will definitely be doing some upgrading 🙂
Beyond the newest version of HDP, is Hadoop 3.0. It could be releasing as early as Q4/2017, and will bring improvements to resource management as well as container support (yay!). This will make Hadoop in itself more resource-aware, and mean better performance. The usage of Docker has exploded since its initial release four years ago, and some of the newer Hortonworks apps, such as Cloudbreak, already take advantage of the technology. So with the addition of container support to Hadoop, YARN could potentially control non-Hadoop services and applications deployed into containers.
The Dataworks Summit is definitely something you need in your life, if Big Data is on your roadmap or you’re already knee-deep in it. I’m glad I went, since getting to talk to the developers and community members directly is invaluable.
Stay tuned for some blog posts on specific technologies related to what was showcased and discussed at DWS17. There are several key part of the new HDP release that can be discussed in greater length.
If you’re interesting in hearing about Bilot’s Big Data offering and how Hortonworks Data Platform can help your organisation, get in touch and let’s talk!
Gartner ennustaa, että vuoteen 2018 mennessä yli puolet suurista, maailmanlaajuisista organisaatioista kilpailee käyttäen edistyksellistä analytiikkaa ja omia algoritmejaan aiheuttaen luovaa tuhoa kokonaisille toimialoille.
Menestyäkseen yritysten on kyettävä tarjoamaan asiakkailleen, ei hyvä, vaan erinomainen asiakaskokemus.
Paras yksilöity kohtaaminen tarvitsee tuekseen automaatiota, jota ohjaa analytiikka. Miksi? Koska kone päihittää ihmisen monella saralla:
Koneellisesti kerätty, merkityksellinen taustatieto vrs. asiakaspalvelijan kohtaamisen aikana luettelemat kysymykset
Nopeat, koko tietovarastoa hyödyntävät johtopäätökset vrs. yhden henkilön kokemuksiin perustuvat päätökset
Käytännön elämässä automatisoidut ja analyyttiset kohtaamiset ilahduttavat muun muassa silloin, kun Uberin sovellus kertoo, mikä auto saapuu ja milloin. Sovellus kertoo myös reitin ja ennakoi hinnan. Vastaavassa tilanteessa taksikeskus tarjoaa vain varausnumeron, jonka perusteella voin vain olettaa kyydin saapuvan joskus.
Edistyksellisimpi analytiikan hyödyntäjiä ovat B2C-allalla toimivat yritykset. Väitän, että edistyneen analytiikan kirkuvin kipupiste B2B-sektorin yrityksillä on laadukkaan datan puute.
Mutta mihin tarvitaan dataa, jos tuote on kunnossa?
Havainnollistetaanpa asiaa viettämällä hetki avaruusalalla. Ennustavaa analytiikkaa voi verrata avaruusrakettiin. Tarvitset voimakkaan moottorin eli analytiikkaohjelmiston ja rakettipolttoainetta eli laadukasta dataa. Lentäjän ja moottorin saat rahalla ja pienellä koulutuksella. Vaativimmaksi kehityskohteeksi erityisesti B2B-yrityksissä näen rakettipolttoaineen keruun. Päivittäisissä kohtaamisissamme lähes jokainen CIO korostaa, kuinka paljon heillä on dataa. Tämä data on avaruusraketti-kontekstissamme raskaaseen polttoöljyyn verrattavaa tavaraa, jolla raketti ei lennä. Jotta raskaasta polttoöljystä saa rakettibensaa, sitä on jalostettava, ja ennen kaikkea siihen on lisättä lukemattomia lisäaineita. Perusjalostus on B2B-yrityksissäkin kohtuullisesti kunnossa, mutta arvokkaisiin lisäaineisiin verrattavaa tietoa loppuasiakkaista riittävän monesta kanavasta heillä on olemattoman heikosti.
Mitä pitäisi tehdä tänään, jotta voisi lentää seuraavalla tilikaudelle Marsiin? Alla lista, ajatuksia joihin kannattaa tarttua tänään:
Julkista tuote, jonka perimmäinen tarkoitus on kerätä arvokasta dataa asiakkaasta. Niin maailman parhaatkin tekevät jo. Luo hyvä palveluportaali tai käynnistä lopulta se IoT-hanke. Keskustele verkossa. Analysoi yksityiskohtaisesti, milloin ja miten loppukäyttäjä käyttää tuotettasi tai palveluasi.
Kerää kaikki raakadata talteen: web-palvelimet, some-analytiikka, digikampanjoiden tiedot. Näiden avulla luot todellisia kohderyhmiä ja teet aidosti osuvia next-best-offer-ehdotuksia verkkosivuilla ja asiakaskäynneillä.
Kerää kaikki data yhteen paikkaa, myös asiakastyytyväisyystulokset, mielikuvamittaukset – aivan kaikki. Monet maailman johtavista hyödyntävät tähän Hadoopia.
Kutsu kaikki talkoisiin. Viritä rakettimoottoriasi ja kehitä henkilöstöäsi, niin he kertovat mitä lisäaineista sinulta vielä uupuu.
Data ja osaaminen jalostuvat käyttämällä. Ryhdy keruuseen tänään ja voit lentää heti – pian huomaat olevasi yksi heistä, joka saa aikaan luovaa tuhoa.
Teen kesän aikana jo toiselle asiakkaalle shortlistaa IT-toimittajista, jotka toteuttavat tietovarastoja, osaavat ETL-työvälineenä Microsoftin SSIS:ää ja mahdollisesti mallintaa data vaultilla.
Tarve on siis hyvin spesifille korkean tason osaamiselle, mielellään serfitikaattien kera.
Tunnen alan toimijat hyvin joten helppo homma?
Kukaan ei halua olla duunari
Nykypäivänä ainakin firmojen web-sivujen mukaan kukaan ei tee tietovarastoja. Se on passe. Suorituskyvyn johtamista, koneälyä, big dataa, integraatioratkaisuja, tiedolla johtamista, IoT:tä ja digitalisaatiota tuntuu tekevän kaikki. Mitä ne sitten ikinä tarkoittavatkaan?
Mutta harva asiakas ostaa yhden digitalisaation. Tai yhden integraation. Tai kaksi tiedolla johtamisen konsultaatiota ja yksi suorituskykyn johtamisen ratkaisu.
Jopa sellaiset yritykset jotka tunnen erittäin hyvin ja tiedän heidän olevan maan huippuja tietovarastoinnissa ja esimerkiksi SSIS-työvälineen käytössä, vaikenevat verkossa täysin näistä taidoista.
Tällöin asiakkaiden on erittäin vaikea ostaa osaamista. Ja kaupat jäävät syntymättä.
Näissä parissa casessa olenkin joutunut kysymään Microsoftilta suoraan, ketkä suomalaiset IT-talot toteuttavat esimerkiksi tietovarastoja Azureen, keneltä löytyy SSIS-osaajia, ketkä hallitsevat teknologian X.
Ja näillä konkreettisilla hakusanoilla asiakkaat usein kumppaneita etsivät. Vaikka tuomiopäivän pasuunat muuta väittävät, tietovarastoja tehdään edelleen täyttä häkää. Ja isoja sellaisia. Ja paljon parempia, monipuolisempia ja fiksumpia kuin vuosikymmen sitten.
Firmat ajattelevatkin varmaan, että tietovarastojen toteutus on liian bulkkia, leimaa heidät duunareiksi.
IT-firmojen pitäisi ottaa mallia Timo Soinista
Kaikki tuntuu haluavan näyttäytyvän korkeamman tason digitalisaation sanansaattajina, oppaina ja konsultteina. Tunnustan: niin mekin.
Tämä ylätason huttu on kuitenkin juuri sitä ihteään. Huttua. Liian epämääräistä, liian korkealentoista, että sillä olisi oikeasti käyttöä. Viestintää joka menee täysin hukkaan.
Toisaalta kun kaikkien tuntema tietovarastoja sorvaava akselirasvaosaston IT-firma koittaa näyttää korkeamman jalostusasteen johdon konsulttifirmalta, Accenturelta, Mckinseyltä ja mitä lie, niin jossain vaiheessa siitä jää kiinni. Viimeistään silloin kun propellipää “konsultti” menee asiakkaan johtoryhmälle puhumaan biteistä ja pilvestä. Sieltä lentää äkkiä niskaperseotteella ulos.
IT-firmojen kannattaisikin ottaa mallia Timo Soinin selkokielenkäytöstä. Puhua. Yksinkertaisin. Lausein. Ehkä firmojen pitäisi popularisoida web-sivujensa viestintä. Kansanomaistaa se.
Olisikin ilahduttavan pirteää nähdä IT-firma, joka toteaa webissä etusivulla: me teemme tietovarastoja Microsoftin teknologialla.
Olen pommin varma, että tällöin tulisi kauppaa enemmän kuin diipadaapalla. Ei ehkä jytkyä mutta kauppaa silti.
Ps. Näihin pariin caseen shortlistit on luotu ja toimittajat on kontaktoitu tai tullaan pian kontaktoimaan. Mutta näitä toimeksiantoja tulee meille kuitenkin yhtä enemmän ja enemmän eteen.
Jotta helpotamme omaa työtämme ja palvelemme asiakkaitamme paremmin, lähdemme ylläpitämään listaa suomalaisista eri tiedolla johtamisen osa-alueille erikoistuneista yrityksistä. Toimimme puolueettomana konsulttina ja etsimme asiakkaalle parhaimman toteutuskumppanin.
Jos yrityksesi on siis erikoistunut tietovarastoihin, data science:en, business intelligenceen, raportointiin tai muuhun tiedolla johtamisen alueeseen, ja olet jatkossa kiinnostunut saamaan tarjouspyyntöjä isoista DW/BI/DataScience/IoT -projekteista, nakkaa vaikka maililla (email@example.com)
Teknologioita on pilvin pimein ja kaikkia softia ja niiden osaajia emme ala listaamaan mutta aloitamme ainakin näillä, joista on tullut eniten kyselyjä:
The 2016 Hadoop Summit is almost here! Last year the summit was jam-packed with great sessions and was extremely inspiring. Just as last year, I’ve put some effort into choosing which sessions I’ll be attending in order to get the best experience.
I have a serious problem though – even more than last year. There are so many interesting sessions overlapping each other that I apparently need to be able to clone myself!
Last year, my strategy was to prioritize the sessions and start with the one most relevant. If it wasn’t exactly what I had envisioned, I rushed to the next one — the strategy worked out well most of the times.
The full agenda of the summit together with all the details can be found here .
Here is my current (in progress!) shortlist – albeit without prioritization yet:
Are you coming to San Jose this year? What are your hot picks from the agenda?
Follow me on Twitter as I’ll will be tweeting during the Summit week in the end of June.
Hadoop and other open source Big Data projects provide huge range of IT software for areas of data management and system integration
In the first part of my blog series I claimed that Hadoop offers a superior range of tools to do amazing things. The summary blog emphasized the marriage of analytical aspects and real time system integrations and messaging. The purpose of this part is to depict what those tools really are and the purpose of each.
In the terms of a traditional data warehousing and analytics tools the process of data loading and transformation in Hadoop is closer to ELT (database pushdown centric) than ETL (ETL server centric transformations). For the data extraction from relational databases we have Sqoop. For sourcing a continuously generating log files we commonly use Flume. If the source systems provide transfer files, we drop them to the Hadoop file system (HDFS) and use them directly or indirectly as the “data files” of Hive “database”. The Hive component provides a database query (SQL + odbc/jdbc) interface to Hadoop. In practice this means that the schema descriptions are defined on the top of your data files (for example csv-file) and you can use SQL to query and insert data. The transformation part of data management is done with a Latin language.
As you are ready with the things you have done years with ETL and data warehouses, with Hadoop you can move to the area of real data science. The Pig Latin language is actually a simplified way to write powerful Map Reduce language. Map Reduce is Hadoop’s programming language to do deep analysis of a huge amount of data. The drawback of Map Reduce is the need for writing code – even bit too much for simple things. If you do bother to learn the Map Reduce syntax, check Spark. Spark allows you do the same things and even more with one of the languages you probably already know: SQL, Java, Scala, Python or R. Actually you can even mix them. As the languages mentioned here may indicate Spark includes statistical libraries for even a deeper data science analysis and graph database functionalities for understanding how Social Media connections, products in market baskets and telecom subscribers form networks. The understanding of the networks is important for identifying the real influencers and you can focus your action on them. Spark also loads data under processing into memory and does all processing even 100 times faster than Map Reduce or Hive.
Spark has even move advantages. It can run in a stream mode, meaning that it processes, merges and other ways enhances and transforms your data as it flows in and pushes it out to a defined storage or a next real time process. In the traditional terms it is a component to do complex event processing. So we have arrived to the area of traditional Enterprise Application Integration (EAI) and messaging. Another bit more traditional Hadoop component for the complex event processing is Strom. In system integrations, before complex event processing, you need way to manage message flows i.e. you need a message queue system. For that purpose the Hadoop umbrella has Kafka. Kafka is fed by different streaming sources like Flume mentioned earlier, more traditional EAI tools or SOME sources like Twitter directly.
How do we then integrate results of patch analysis or stream processing to other systems like on-line web pages? The Hive and Spark components described earlier offer an odbc/jdbc interfaces for doing SQL queries to Hadoop data. However, truth is that these are not capable for the less-than-second response times. The response times are enough for analytical visualization clients like Tableau, MS PowerBI or SAP Lumira, but not enough for the on-line web shops or many mobile applications. For the fast queries for massive audience Hadoop offers Hbase noSQL database component. Queries for Hbase can be assigned either through build-in API’s or SQL/JDBC using a Phoenix extension. Developments is continuing actively and as example there is project going on to enable full Spark API access from outside of Hadoop. More about these in coming blogs.
Although the Hadoop stack does not have traditional Enterprise Service Bus (ESB), the critical parts of the system integration and advanced analytics have married together and only your imagination is the limiting what kind of services you can provide. I would recommend to spend few minutes imaging out how these tools are running behind LinkedIn, Facebook, Uber, and biggest web shops etc.
And all this you can have on your laptop for free and forever, even for production use, if you just have eight gigabytes of memory to run it. All this may sound overwhelming, but you do not need to use and master all of the components, like you never even think of buying every item from IKEA’s store.
Hadoop is software package at prices so low that almost every company is able to afford it already
In the first part of my blog series I claimed that everyone can afford Big Data tools. The summary blog emphasized the license cost aspect and depicted how it does not exist. Like mentioned then this is not the whole truth. On the high level the Total Cost of Ownership (TCO) includes the total cost of acquisition, the operating costs and the costs related to replacement or upgrades at the end of the life cycle.
Specialty of the Hadoop system is that the acquisition costs are significantly lower compared to a traditional commercial software as there is not a license cost. For developing an actual application to be run on the Hadoop platform you most probably need external help. Even though companies using Hadoop seem to have quite “do it yourself” attitude. And naturally the cost of internal resources is a cost too. In general the hourly price of an external project resource for developing a Hadoop application is comparable to the level of any other enterprise level technology.
So, what kind of operating costs you should consider. In practice you just pay for the server hosting and the maintenance/support subscription if you want to guarantee the continuity of the service you provide with it. I do not have any statistics, but the hosting costs are often lower than with a traditional software, especially if the commercial software requires a vendor specific appliances. The very natural place for a Hadoop clusters are the public clouds like Azure or AWS which are very reasonable priced. The on premise hosting by a local provider costs, for relatively small cluster (~10 servers), few thousands euros per a month which is very reasonable price. The costs increases naturally about linearly when you need more servers, but then you can expect also significant bay-back for the investment too.
As you move to the production usage, depending on the business criticality of your application, you most probably want to purchase a maintenance/support subscription. For example Hortonworks sells different SLA levels at a very reasonable price compared to a traditional commercial software. The company gets its revenue only from these services and mainly from the subscriptions. However, “Hortonworks is expected to become the fastest growing software company — reaching $100 million in annual revenue in just four years from inception”. This describes the value-add they provide and the global success of Hadoop.
As a summary the implementation of Hadoop and other open source software is not free, but TCO is very competitive. It is so competitive that almost every company is able to afford it already. Just download the software from the Hortonworks’ homepage or launch your first cluster with Azure trial account for free.
On next week in the part three I will talk about how Hadoop together with other open source big data projects provide a huge range of IT software for the areas of data management and system integration
IKEA is a huge success story and has been a clear game changer in the furnishing industry. I am convinced that Hadoop and its related open source data management tools will do same for the IT industry. What we have already seen are just the early stages of a huge disruption we will witness in the coming years.
My faith is based on the fact that this ecosystem has the same advantages as IKEA has in its’ business idea, which is: “to offer a wide range of home furnishings with good design and function at prices so low that as many people as possible will be able to afford them.”
This business idea has three key components which can be translated to the IT world in the following ways.
“As many people as possible will be able to afford them” ➜ “Hadoop is a software package at such a low price that almost every company is able to afford it already”
Ikea’s idea says “…products that are affordable to the many people, not just the few”. Any average company in either the Furnishing or the IT industry tries to maximize its sales prices to as high as market competition allows. This leads to a situation where the amount of features is maximized and absolute prices are quite high. IKEA’s approach is just opposite. Features are optimized for most people and the whole value chain is aimed at lowered costs. Finally, sales price is pushed as low as possible, regardless of what the competitive situation is.
In the case of Hadoop and other open source products, the situation is even better. There does not exist a party trying to maximize sales margins for the initial purchase. These products do not even have sales prices. Companies like Hortonworks actively developing their own version of Hadoop do not ask for a price for the product itself – you can download the software from their website at any time, and there are no license restrictions forbidding its use in production environments. Their business model is to only charge for the services they provide. In the same sense, it would be like IKEA giving away their furniture for free and charging only for home delivery, assembly and warranty. Like in the case of IKEA furniture, Hadoop tools expect a more “Do-It-Yourself” attitude, but at the same time cost savings can be huge.
Naturally license costs are not the whole truth. In part two of this blog series, I will go through how you should consider things like hosting, maintenance and skill costs.
“A wide range of home furnishings” ➜ “Hadoop and other open source Big Data projects provide a huge range of IT software for areas of data management and system integration”
When you step into an IKEA store, you immediately understand that the offering is wider than in any other chain – it is huge. Similarly products under Hadoop’s umbrella cover all major aspects of traditional data warehousing and system integration. So you have a tool for almost every purpose and blueprints for how to make them communicate with each other.
Even more important than just availability of the tools, is the way it changes your thinking. As people start to master both the real-time system integration aspects and analytics, they start to realize that building automated decision making processes are a realistic goal. Automated decision making and real-time information are changing all industries at the moment. For example, at Uber there is no-one matching cars and passengers or informing both parties in real-time where the car is at each moment. Additionally, their system is even doing predictions about costs and arrival times. And this process is running simultaneously for millions of people.
In part three of this blog series, I will talk about what are some of the Hadoop and Big Data software components available and how those can do similar things for you as they are doing for Uber.
“Home furnishings with good design and function” ➜ “Hadoop tools are designed to solve issues impossible for traditional commercial tools”
For decades IT was too expensive for everyone. Roughly ten years ago internet startups started using low cost and open source software to build their services. These open source tools pretty much copied the functionalities of the commercial tools. At the high level the main difference was economics behind them. Suddenly, you didn’t need significant funding to set-up an IT company and become a millionaire.
In the shadows, these new internet companies grew faster than anyone realized. Quickly they ran into overwhelming technological challenges, which traditional enterprises and software vendors never needed to tackle. The newcomers needed to invent a way how to provide an almost 100 % automated service 24/7/365 for millions of users. If you have ever operated IT systems, you know how 99 % of CIOs would see these kind of requirements as totally impossible for their software landscape and infrastructure.
Some of the startups understood that they needed to build their own software starting from the file system, database and system integration level, in order to scale and survive. As a result, we have companies like Facebook, Yahoo, LinkedIn etc. As this software development is not their core business, they soon started to share their developments with others for free. In many ways this era of software is superior compared to traditional ones, but of course they have also some limitations.
In part four of this blog series, I will talk about what are the more detailed requirements Big Data tools were developed for. How they solve these issues and what kind of compromises you need to accept.
A holistic view of your customer behavior is something that all companies strive for. The 360-degree customer view, although sometimes considered unattainable, is something that can be achieved with the right platform.
Whether you run a webshop or an old-fashioned brick-and-mortar business, it’s paramount that you’re able to easily use all the data you collect from your customers effectively.
This includes a plethora of different sources, ranging from generic row-and-column databases to more complex, unstructured social media data — basically, everything we have come to call ‘Big Data’.
One datasource that sits in the middle of this spectrum in terms of complexity is customer satisfaction and survey data. However the challenge with this type of data is that in many cases there’s a serious problem with data quality. Survey data is produced by third party companies and in many cases is manually created from the results. This produces a large volume of data files which can be a problem for traditional batch process BI systems or databases.
With the power of Hadoop, we can simplify this a whole lot!
In order to make sense of the survey data, we need a place to save and process it. In a traditional BI system, we would have to set up a job to pull these files from a file server. We also want to be able to give our partners (who collect the data) the ability to log into an interface, and upload the survey results data instead of emailing it.
Here’s a high level picture of the architecture. In this example, we’re concentrating on the text and survey data, but in a real-life example we would also be interested in social media and clickstream data in order to fully understand how our customers are behaving.
Here’s where Hadoop’s file system (HDFS) comes into its own. As HDFS is a schemaless distributed file system, we don’t have to worry about definitions when loading data. It’s similarly to that of a normal computer’s file system, where the files are stored in a folder structure.
Introducing the Hadoop User Experience!
Leveraging the Hadoop User Experience, or HUE for short, we are able to give our partners access to our Hadoop cluster in the form of an intuitive user interface. These partners will be able to log into HUE and upload their data, without the need for scripting. This eliminates the need to involve IT, and allows the end users to upload multiple files (even compressed files) into the system, speeding up the acquisition of data.
After a partner has loaded the data into HUE, we can map the data to form a view, which we can then use later. Mapping the data essentially means defining what the files include and building external views which we can connect to from a 3rd party application — like for example Tableau.
As new data is introduced into the system, the sources (views) are updated on-the-fly, as we are reading folders instead of individual files. Very clever indeed.
Setting up a Big Data ecosystem like Hadoop with HUE doesn’t have to be difficult. And with cloud platforms like Microsoft Azure it’s even easier than before!
Hadoop is the new black, no doubt about it. Hadoop ecosystem is taking its place as a standard component in the enterprise architectures side-by-side with ERPs and other standard components. Hadoop enables modern data architecture and ability to create analytical applications on top of enterprises data.
It is no secret: Hadoop is not easy! Good thing is that as it grows more mature all the time, we are seeing more and more automated and out-of-the-box solutions for data transfer, analytics, administration, operations – and deployment.
Deploying and configuring Hadoop cluster can be a very complex task when done and optimized correctly. Gladly there are today several tools and platforms which can make this process significantly easier, reduce risks and components which you have to maintain yourself. I will introduce just a few options, which I have tried out myself.
Azure Marketplace and Hortonworks Data Platform (HDP)
I was very surprised of the Cloudbreak’s capabilities earlier last year already when trying it out, but Azure surprised me even more – even though the idea and the functionalities are not exactly the same. Launching a standardized Hadoop environment for test or pilot purposes has never been this easy – at least for me! It is in theory as easy for production use, but you do need to plan your architecture a bit more to match your use cases even though this is a highly-automated IaaS/PaaS type of cloud environment.
“The Hortonworks and Microsoft relationship has enabled a seamless implementation of Apache Hadoop on Azure.”
What I really needed was just an Azure account and a ssh-rsa key, which I got in 5 minutes. New Azure users even get a 30-days free trial with some funny money budget, which is enough to deploy a 5-node HDP cluster in Azure with 8Tb of disk per node. The deployment process itself is fully automated. Obviously you want to select the subscription, number and size of nodes etc. when doing anything more than evaluating the product and the process.
Psst. There are also other options than Hortonworks HDP available in Azure Marketplace – but my choice is HDP as it is THE Hadoop platform and the only one committed to being 100 % open.
Hortonworks acquired SequenceIQ in 2015. This acquisition provides Hortonworks technology which can automate Hadoop cluster deployment process to public or private cloud environments. It has nice features such as policy-based auto-scaling on the major cloud platforms including Microsoft Azure, Amazon Web Services, Google Cloud Platform, and OpenStack, as well as platforms that support Docker containers.
I had chance to see Cloudbreak live in action already in Hadoop Summit June 2015. To be honest it looked too good to be true. It was my turn to try Cloudbreak hands-on in Hortonworks Masterclass in Stockholm later 2015: It is easier than it even sounds! Main requirements are really that you need a Cloudbreak installation, you pick a blueprint, choose a cloud and deploy! Now this baby is a permanent part of our own on-premises solutions. I warmly recommend everyone to try it out!
If you want hear more about HDP Hadoop, Modern Data Architecture and Azure Marketplace you may like these blog-posts: