Thursday, June 20, 2019

Quantum Computing and HPC

Quantum Computing and HPC

Another scintillating and insightful episode of RFHPC is about Quantum Computing and HPC and how the two spaces are evolving and cooperating.

We welcome a a distinguished guest with a most suitable background to talk to us about HPC and Quantum Computing. Mike Booth,  who’s been in supercomputing since 1979 including stints at Cray through 2000 where he ran the Software and Applications division and was later a GM at StorageTek heading the network storage division. He got into Quantum Computing when he joined D-Wave. He had just accepted to be the CTO of Quantum Computing, Inc. when we recorded this show.

We discuss and touch on how Quantum Computing and HPC interface, analog vs digital, qubits, magnets, resistors, connectors, cryogenics, algorithms, languages, the huge search spaces, NP-complete problems, quadratic unconstrained binary optimization (Qubo), Tabu search, etc. and how they are two different games right now but touching two sides of the big problems that represent grand challenges. Because QC is an accelerator, it fits nicely with how a lot of HPC is being done today.

We’re going to have to bring Mike back and we look forward to that.

ExaScale at Oakridge

Mike happens to be in Tennessee, and the episode was recorded when the new ExaScale system at Oakridge was announced so the team. That was quite a significant day for US science, and a second big win for Cray, this time with AMD. It's one of the few large systems that is not based on Intel or Nvidia technologies, and was described as:
  • 100 Cray Shasta cabinets
  • 40 MW power
  • More than 1 million lbs weight
  • 7,300 square feet
  • 90 miles of cabling
  • 5,900 gallons of water per minute for cooling
We don't remember who exactly had a hard stop, but no time for Catch of the Week this week, which some of you would be pleased to hear!

Give it a listen (and take good notes!)

Download the MP3 * Subscribe on iTunes * RSS Feed

Sign up for our insideHPC Newsletter

Monday, June 17, 2019

TOP500 Jun2019, Facebook Coin

The new TOP500 list of most powerful supercomputers is out and we do our usual quick analysis. Not much changed in the TOP10 but a lot is changing further down the list. Here is a quick take:
  • There are 65 new entries in 2019.
  • US science is receiving support via DOE sites and academic sites like TACC.
  • 26 countries are represented. China continues to widen its lead, now with 219 entries, followed by the US with 116, Japan with 29, France with 19, the UK with 18, Germany with 14, Ireland and the Netherlands with 13 each, and Singapore with 10.
  • Vendors substantially reflect the country standings. Lenovo has 175 entries, Inspur 71, and Sugon 63, all in China. Cray with 42 and HPE with 40 (which will combine when their deal closes), followed by Dell at 17 and IBM at 16.  Bull has 21 entries.
  • There are a lot of "accidental supercomputers" on the list. These are systems that probably are not be doing much science or AI work but they could, and the vendors counted them and it seems to be within the rules to list them. It's controversial but not a new practice.
  • There are several systems listed as "Internet" companies. Hard to tell what that means but it points to the existence of very large clusters in the cloud for whatever purpose. Last year, there was one system listed as Amazon EC2, which remains on the list. This time, there is also one at Facebook. Usually the big social/cloud players don't care to participate, though they obviously could summon the resources to run the benchmarks.
  • Just over half of systems use Ethernet as a fabric. A quarter us InfiniBand, nearly 50 use Intel's OmniPath, and the rest, 55, use custom interconnects like the ones Cray provides. The team talks about Cray+HPE entering the interconnect business for real and if so, they will be formidable.
  • The majority of entries, 367, do not have any accelerators. 125 use Nvidia GPUs.
  • The overwhelming majority of the systems, 478 of them, are based on Intel CPUs. 13 are IBM, and there is 1 system based on Arm provided by Cavium, now part of Marvell.
  • So the when it comes to chips, it's an Intel game with a respectable showing by Nvidia when GPUs are used. Alternatives are bound to appear as the tens and tens of AI chips in the works become available and Arm, AMD, and IBM build on. The recently announced system at Oakridge will be all AMD, and that will point to an alternative as well.
  • Notably, Intel is listed as the vendor for 2 entries and Nvidia is listed for 4. While Intel has stayed largely away from looking like a system vendor, Nvidia is going for it with its usual alacrity. That, and the pending acquisition of Mellanox by Nvidia should serve as a warning to all system vendors who might feel stuck between treating Nvidia as an important supplier and an up and coming competitor.

CryptoSuper500

Shahin mentions the 2nd edition of the CryptoSuper500 list (really 50 for now), a list developed by his colleague Dr. Stephen Perrenod, which was launched last November, and is being released at the same time as the TOP500. The TOP500 has spawned variations that look at different workloads and attributes, for example, the Green500Graph500, and IO500 lists. CryptoSuper500 was inspired by those lists. The material for the inaugural edition of the CryptoSuper500 list here.
Cryptocurrency mining operations are often pooled and are very much supercomputing class, typically using accelerator technologies such as custom ASICs, FPGAs, or GPUs. Bitcoin is the most notable of such currencies. Scroll down for the top-10 list and see the slides for the full list and the methodology.

Catch of the Week


Henry:

Henry talks about check-out lanes at Target all being down for unknown reasons, though he hesitates to call that a cybersecurity breach. It turned out he's right and the company blamed an "internal technology issue".

Target down (then back up) as cash registers fail and leave long lines

Target's payment systems appeared to be missing the mark the day before Father's Day, as terminals went AWOL for a couple of hours in a number of the company's US retail outlets. The outage caused long lines but prompted an encouraging show of sympathy for Target employees from people on Twitter. And there were some jokes too, of course.

Shahin:

Facebook is expected to release a new cryptocurrency that is already impacting the crypto market.

Here’s what we know so far about the secretive Facebook coin

Facebook is likely to release information about its secretive cryptocurrency project, codenamed Libra, as soon as June 18, TechCrunch reports.
As is traditional with new cryptocurrencies, the social networking giant is expected to release a so-called “white paper” outlining how the currency works and the company’s plans for it.

Dan:

Dan reminds us all of the inimitable Erich Anton Paul von Däniken and his ancient astronauts hypotheses!

Listen in to hear the full conversation.

Download the MP3 * Subscribe on iTunes * RSS Feed

Sign up for our insideHPC Newsletter

Sunday, June 9, 2019

Forty+ different AI chips

What are we going to do with 40+ different AI chips?

This week, the team looks at AI chips again, this time motivated by an article in EE Times about once such chip, Graphcore, and touts it as "the most complex processor" ever at some 20 billion transistors. The VC-backed company out of Bristol, UK is also valued on paper at $1.7b, gaining it the coveted "unicorn" status, apparently the "only western semi-conductor unicorn".

This being one of 40+ such AI chips (and that may be conservative), the odds of success are tough and the task formidable. But even if only 2 or 3 of such chips become successful, that's already a significant disruption to the market.

The Graphcore chip is 16nm, 1.6GHz, and comes in a PCIe card at 300W. You can stack 8 of these in a 4U chassis, so 2.4 kW just for those.

After a mini-rant about respected publications succumbing to clickbaits, the team talks about how cooling will be an issue and calls again for more clarity in performance metrics since the chip is rated at 125 TFlops but we don't know at what precision. Shahin reminds the team of his suggestion to clarify things by including precision in the metric, like DFlops for double precision, and then S for single, H for half, and Q for quarter precision.

Henry talks about how hard it is to build and test complex software like this despite Shahin's view that the modern software stack is too high so the chip need only be concerned with a couple of layers, codes are new and open to getting recompiled, it's increasingly open source, cloud providers and large customers have the wherewithal to do the job, and traditional HPC customers have the willingness to do the work if performance enhancements are there.

No "Catch of the Week" this time since Henry had a hard stop. We're used to it!

Listen in to hear the full conversation.

Download the MP3 * Subscribe on iTunes * RSS Feed

Sign up for our insideHPC Newsletter

Sunday, June 2, 2019

Amdahl's Law and GPUs, Asian Student Cluster Competition

Results of the Asian Student Cluster Competition

In this episode, Dan has just come back from China and reviews the results of the Asian Student Cluster Competition and HPC workshop.
For the first time, a non-mainland-Chinese team wins the top spot. Taiwan takes the gold in part by their stellar performance in HPCG benchmark where they achieved 2 TFlops, some 25% better than the 2nd best team. The system was a 5-node cluster with Infiniband FDR interconnect. Other interesting info is shared on various codes and configurations.

GPUs and Amdahl's Law

Dan also mentions that reports from some of the TOP500 sites suggest that GPUs are doing 93-97% of the computation. This sounds very impressive but Shahin points out that since GPUs have hundreds of cores, they should be doing much better, that 93-97% is in fact not as good as it should be at that scale of system and problem size. He is still waiting for some actual utilization data on GPUs too.

Catch of the Week


Henry:

Henry points out many security cameras, offered by several brands but are all manufactured by the same vendor back in China, have big time vulnerabilities so he's staying away from all of them until further notice. Shahin wonders why they are called "security" cameras!

P2P Weakness Exposes Millions of IoT Devices

A peer-to-peer (P2P) communications technology built into millions of security cameras and other consumer electronics includes several critical security flaws that expose the devices to eavesdropping, credential theft and remote compromise, new research has found.

Shahin:

Shahin talks about Jaguar-Land Rover planning to offer a cryptocurrency wallet to reward drivers that participate in providing traffic and other types of data. He likes their catch phrase: zero emission, zero accident, zero congestion.
Drivers will be able to earn cryptocurrency and make payments on the move using innovative connected car services being tested by Jaguar Land Rover.

Dan:

Dan laments the confiscation of his external camera battery at the airport in China because the spec label was a little worn off and the authorities could not read it to ascertain its safety despite his willingness to get a note from the airline, etc.  Nice expensive battery, but at a medium-sized paperback book, maybe following the rules strictly is not a bad idea.
Listen in to hear the full conversation.

Download the MP3 * Subscribe on iTunes * RSS Feed

Sign up for our insideHPC Newsletter