Requesting suggestions for languages, libraries, and architectures for parallel (and sometimes non parallel) numerical and scientific computations

manueljenkin · Jan 16, 2022

A lot of the work I am interested to do will be mostly built from scratch by myself, provided there is fair support for numerical types (like complex numbers) and high precision numerical operations (if not, I'll be happy to write those routines as well). Many of my areas of interest are computationally demanding (python codes choke for large enough datasets) but are often parallelizable, and I am looking for guidance on implementing the same. I love math and physics, especially domains that involve rigorous analysis, ranging from physical/mathematic concepts like turbulence, topology, wave optics, electromagnetism and quantum physics to computational concepts like cryptography and information theory. I also love signal processing, especially relating to random and sparse signals. They require a decent amount of precision while simultaneously being fast enough. I wish to be able to run the code on low power and high power manycore or SIMD processors, with the sequential parts being run on a general-purpose processor or highly pipelined FPGA. Energy efficiency is one of my key targets along with speed (many scenarios are energy constrained) even if it requires a longer and customized code. Another area of interest I have, while not my primary goal is to implement redundancy using parallelism (including different compression storage methods: eg RAID). I would like to have some control over the different memory allocations (hierarchies of caches and scratchpad memories) and if possible, some of the caching schemes while still being usable across multiple architectures. If possible, options to optimize for burst and broadcasting, prescence or absence of hardware lockstep, depending upon hardware support (use of switches to do different routines for different hardware when it comes to memory copy and allocation, basically caching). Sorry for the long and open ended question, I realized it would be hard to really come to a decision without getting a holistic picture of the whole domain atleast to a fair level of depth. I am looking for suggestions in both hardware and software for the same.

My primary concern is software - including but not limited to languages, compilers and directives. Being not from a software programming background, I find it hard to search for the proper areas (and keywords). I would like to share my current understanding of the scenario in terms of software - I have currently explored Cuda, OpenCL, Fortran, OpenMP, OpenMPI, OpenACC, Julia, Scala, Vulkan, Chapel, X10 and additionally languages like Parallel Haskell, SequenceL, OpenHMPP, SHMEM, Chapel and have also explored forums for similar topics. While not of relevance to my use case, I also looked into wikipedia pages for languages like Simula and Smalltalk. I would prefer an open source language, but I find the Wolfram language very interesting as I also have interest in symbolic computation in general, like being able to evaluate integrals with the points being variable (for complex, or multi dimensional integrals, even the path being variable). I am familiar with Matlab, and I like its libraries for numerical precision optimization (NumPy can accumulate errors and show nonzero value where it should be zero), but not into it much these days due to limited performance, and platform support. Sharing some relevant links:

https://bisqwit.iki.fi/story/howto/openmp/

https://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf

https://www.lei.chat/posts/what-is-vulkan-compute/

http://jhenriques.net/vulkan_shaders.html

My understanding of anything relating to software is somewhat naive since my personal work profile and courses were not into this area, and most of my sentences below is something I got from personal exploration. Please feel free to correct if you find errors in the post. Vulkan is one framework that I find very nice in that it gives me access to different memory hierarchies, but I am worried about the support and fragmentation (OpenCL has multiple variants) especially for low power hardware and for architectures outside of GPU. It is an architecture promised for heterogeneous use case, but I am not feeling very confident in it. While not a deciding factor, I also happen to prefer indexing from 1 instead of 0, and I find Fortran (with parallel programming support) and Matlab to be quite nice for that reason.

My requirements for the language are somewhat niche and specific, with having some control over flow and execution of the data, while also being abstract enough for supporting a decent number of architectures and devices. By decent number of devices, I do not intend to mean every device under the sun. I am not looking to target very old processors, or single core processors, or low power microcontrollers, but rather the SIMD/SIMT based devices to decent extent (say 10 cores or more), examples of which I will be mentioning below. The language should be easy to manage and use by someone who is not a computer scientist, while also giving me access to parallel processing optimizations including some control over memory hierarchies and maybe even caching procedures.

For hardware, one of the best matches I was able to find in hardware is Adapteva parallela, which I find to match my requirements a lot (parallel processors for fairly quick computation connected to an FPGA through network for quick decision-making logic). If only it were not for the availability and limitations of being from 2015 (especially limitation to fp32 in the ALU, which to be honest is also a problem with most of current GPUs as well), and the project had been discontinued in 2017. I wish to target low power devices, ones that still can do decent computation, like the gapuino gap 8, with enough cores or SIMT support. List of hardware that currently interest me are

1. Single board manycore processors like adapteva parallella, gapuino gap8 (PULP cores), and projects like OpenPiton. I could also include NXP Layerscape LX2162A.

2. Server manycore processors like cavium thunderx, amd epyc, kalray mppa, amd and nvidia GPUs.

3. Unsure of other accelerators like Untether AI (in memory computing), Myriad X, and GPU architectures for embedded devices like Mali, etc.

Relevant Links:

http://mkaczanowski.com/parallella-part-1-case-study/

https://github.com/adapteva

https://github.com/parallella

http://parallel.princeton.edu/papers/openpiton-asplos16.pdf

http://parallel.princeton.edu/openpiton/

https://pulp-platform.org/

https://pulp-platform.org/hero.html

https://www.anandtech.com/show/1647/3

https://github.com/PrincetonUniversity/openpiton

https://streamhpc.com/blog/2016-06-...rid-processors-parallella-kalray-and-knupath/

https://fuse.wikichip.org/news/4911...s-massively-multi-core-risc-v-approach-to-ai/

https://en.wikipedia.org/wiki/Cell_(microprocessor)

https://fuse.wikichip.org/news/3217/a-look-at-celeritys-second-gen-496-core-risc-v-mesh-noc/

https://www.phoronix.com/scan.php?page=article&item=cavium-thunderx-96core&num=2

The code will be scaled to fit the smaller processors by reducing precision (FP64 to FP32 or int64 to int32, and also the overall resolution scales, with the optimal truncation functions obtained from the high precision computation). I am eager to learn more about other options in the above two categories. I am interested to learn more about the different memory architectures (NUMA, PGAS, Memory Coherence, Distributed Memory etc), and the different overall ALU architectures (pipelined, SIMD, manycore, SIMT, architectures sharing stack pointers across multiple executing units if any, CAPP, etc) (MPSoC, Grid Computing ) as well, and I would also love to know more suggestions for other categories of processors for high precision numerical computation. More relevant Links for memory hierarchy optimization:

https://cnx.org/contents/gtg1AzdI@7...d-Global-Address-Space-PGAS-Programming-Model

https://www.cct.lsu.edu/~korobkin/tmp/SC10/tutorials/docs/M11/M11_Part1.pdf

https://cug.org/5-publications/proceedings_attendee_lists/1997CD/S95PROC/303_308.PDF ( Shared Memory Access (SHMEM) Routines )

To my current knowledge, OpenMP + C is the most widely supported parallel programming language model but I don't think it gives me enough optimization possibility, especially in terms of cache hierarchies, and heterogenous computing. Both C and Rust seem interesting but feel more general purpose for me. I find the combination of Fortran with directives like OpenMP and OpenACC (the latter more preferred due to ability to target SIMD types specifically) to be interesting. My primary concern with fortran is about support for different hardware, especially low power ones (support for compiling from fortran to embedded system code). I seldom find any low power hardware explicitly mentioning support for Fortran or OpenACC. Would it be possible to compile it to a supported format for these processors? Directives are interesting for me since it is possible that at a later point of time, the functions could be natively supported by the language. Another area I would be interested to know about is support and optimization for bitwise operations, since there are scenarios where I might look to pass data or instruction comparisons by means of a single variable that can be decomposed later (for efficient usage of cache memory).

https://www.nextplatform.com/2019/01/16/burying-the-openmp-versus-openacc-hatchet/

It helps to mention what I do not want in the language, since that is something I am clearer on. Specifically, I find the non-scientific computing related design choices in certain languages a bit uncomfortable to navigate through. I do not want a python like language - it is too implicit, with thousand ways to do the same thing, thousands of libraries to do the same thing, and hard to find specific solutions (While there is a lot of help online both in manuals and forums, I find many of them less specific and quite convoluted, with examples always dealing with more than one problem at once, especially in the documentation for libraries like PyOpenCL). Not a fan of too many dependencies and patchworks (referring to C libraries inside the python program), being someone who finds it hard to use package managers and version controlling. I did not enjoy using anaconda package management, I couldn't figure out how to download and install packages locally, and even then, it tries to override things. One instance, it even managed to completely break my windows installation and recovery images(I don't quite know how it did that), and I couldn't even use the recovery image. I had to re-install from scratch. And thousand ways to do the same thing here as well, pip, conda, etc, again super confusing. I find the NumPy syntax and indexing confusing (doesn't feel intuitive, and have made many mistakes that took a long time to correct), the structure of for loops confusing (stemming from that it is meant to traverse a list instead of indexes, so I have to reverse this every time in my mind), usage of curly and square braces (I have encountered many errors due to this), and the indentation-based loop structures confusing as well. Even after half a year of using python, I haven't been able to get myself accustomed to thinking "the python way", even after nearly half a year of using it, likely because my problem and use case doesn't fit into the paradigm. I find it annoying that it's structured to make sense of things regardless of how faulty the code is, often making incorrect computations, instead of directly throwing an error, and I have to debug it from a different perspective. I also tried experimenting with threading for general purpose, and I couldn't figure out a way to make it elegant (Had to manually type for every core). I also found it's plotting functions to be cumbersome, having to do random patchworks for things like scaling in a particular way etc the methods for generating output size also don't make sense to me (first having dpi, then mentioning inches - why can't we use SI system here please?). I found latex plotting to be far easier to learn and use. I just don't find it to be the choice of my use case.

I do not find C to be as cumbersome, ambiguous and most importantly unpredictable as compared to python (even for pyCuda, I found the kernel codes easier to write than the host code), but I would be interested to move to another language that's less alienating for non-software engineers who still would like to get deep into optimizations and explicit memory management (I still remember having issues with understanding C). I don't know if there is work arounds in C, but my use cases often will need variable array lengths, but defined before compiling. I am not sure if this can be done with some form of #Define and using if Define (I think they are called compiler arguments). I remember working on a C code where depending upon the if condition, it will compile different parts of the code. Relevant links:

https://stackoverflow.com/questions/22296810/compile-time-information-in-cuda

https://diveintosystems.org/book/C14-SharedMemory/openmp.html ( Implicit threading with OpenMP)

I don't dislike python as a programming language in general and I don't intend to disrespect the choices (except maybe the package management which I find to be unreliable). In fact I have been and would be using python for my other professional and academic tasks. It was much easier to learn OOP in python (was never able to understand it in C, even though Python OOP is not necessarily the same concept), and I am pretty sure the boilerplate code for CUDA and OpenCL would have been even more verbose in C (but likely more understandable and customizable owing to proper documentation that their python bindings lack). Its choices are well suited for other areas of web development (to have fast compile times for the modular components), simple animations (Grant Sanderson has made an amazing math visualization library in python), and even Machine learning and AI. It's just that my hobby use case is more or less orthogonal to those, and I prefer to avoid it, as I find it very distracting from the core of my work use case.

I have some prior background in computer architecture and C, Matlab, and Python. I would like to also say that I already have some parallel codes written in CUDA and OpenCL. I love the fact that CUDA framework gives me access to different memory hierarchies to optimize for fast code - explicit scratchpad memory allocation, some access to read only cache. I wish for even lower level access to use large masks, and it looks like restrict keyword gives some access, but not predictable. Why not stick to CUDA? Well it's not truly heterogeneous and I'm out of choices for embedded systems running CUDA (unless I could transpile the code, which again the only other manufacturer with reliable library is AMD with its HIP platform) since Nvidia doesn't have embedded real-time processors with low power footprint for my application as well and double precision computation cards are too expensive for the other area. In addition, while I really like the level of memory access I get, it is still a little bit abstracted in some areas (not being able to have reliable ways to communicate to CPU in the intermediate stage of a GPU process, like sending an interrupt like message which might be useful in some scenarios for me where I wish to offload that thread back to CPU asychronously, and also not being able to truly optimize the caching scheme for the read-only cache from its uniform buffer, known as constant memory). Few other areas I am not sure of is the compiler argument part I mentioned in C (where I wish to define array sizes using a Define argument that defines the array size just before compilation using a variable). I wish to move to a language that is fairly more portable, open and scalable for different types of parallelizable architectures (manycore, multithread, SIMD, etc). For now, I am more interested in the computation part than the interfacing part so optimized buffer structures and interfacing with other components (especially asynchronous, but sometimes synchronous) is not a big issue. Most of my use cases would be in systems without display anyway. Eventually it may be of importance when implementing realtime systems or very heterogeneous systems (if they become event driven architectures), but it would be fine if I have to resort to combining with device specific code or libraries in those cases, since fine tuning would be something specific to the hardware (things like hard drives have individual firmware for each batch finely tuned to control the rotation speed, motor controlling the head etc). Some relevant links:

https://stackoverflow.com/questions...tween-constant-memory-and-const-global-memory

https://stackoverflow.com/questions/43235899/cuda-restrict-tag-usage

https://forums.developer.nvidia.com...const-kernel-arguments-hurt-performance/61607

These are my primary requirements for the lanugage. There are other functionalities, which I would be happy to have but not of primary concern right now. I am not familiar with the differences between static and dynamic languages. The tasks I wish to explore are not mainstream AI or Machine learning, or even general linear algebra, so prior library support isn't a big concern but it would be of use if available. Regarding the language structure, I find some of the language structures for calculators like concatenative and functional programming nice. I am also curious about tacit programming. Auxiliary features like Dynamic Parallelism (probably for mandelbrot simulation), etc are cool and useful but not a major necessity for my use case. I believe that it is internally some memory management that I wouldn't be worried to take it upon myself to handle. And in general I'd prefer CPU to handle these complex conditions than GPU. Relevant links:

https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles

I would like the language to have a fine balance between hardware specific optimization and open-ended general-purpose coding, with the assumption that the compiler would do a good job from that point (neither explicit POSIX threads, nor super implicit parallelization). Which leads to another rabbit hole - understanding compilers, especially performance optimization. Exploring uniform memory caching into read only cache in Vulkan (or constant memory in Cuda, using either constant or restrict keyword), I am trying to see if I could improve the performance further. I am unsure if it gives me access to the low level caching, and only mentions at a higher level that it will cache 8Kb of memory into read only cache at a given instant in CUDA hardware, without mentioning how and where it will be organized. I am also interested to know more about variable workgroup sizes (I am not sure if it is possible in anything other than vulkan), that can be specified before compilation. There are couple of complexities in this I believe, one is being able to pass on proper indexes, and another is to check if the thread should execute or not. I would more likely be taking a standard approach of the size being a power of two, and the internals being checked by some other variable for tracking. Say I want 30 threads to run, I would invoke 32 threads, and use a comparison on the inside to execute only 30 of them and relax the rest two from computing. I think there may be ways to implicitly mention this, calling only 30 threads (I believe OpenCL allows this), and let the compiler handle this task of optimizing it into internal hierarchies and inserting the comparisons as necessary. I believe this is called as a specialization constant. I feel this could also be used for other comparisons to check if the thread should run or not, and if it should be run as a Just In time compiler, or fully compile the data before the task. Relevant links:

https://www.intel.com/content/www/u...top/compilation/specialization-constants.html

https://blogs.igalia.com/itoral/201...rmance-with-vulkans-specialization-constants/

I would like to know more about compilers in general (especially machine language translation part) more specifically optimization and branching (and explore further on this topic) so that I could write my codes in a more optima way (and maybe someday also contribute to compiler design). I am unsure if compilation time would be significant as compared to executing the threads in parallel, but I am curious to know more about it as well. I believe the compile time define mentioned earlier is handled by this code as well. I would like to know if it is possible to have a modular compiler syntax, say most of my code is common, except that there is a choice of picking one of two functions depending upon the compile time define. I may have to use the two functions at different times in the same program depending upon some other constraints. Would it be possible to have the main code compiled separately and stored (like a modular code), and then one of the two functions being compiled on demand and branched. I am sure doing that would certainly have trade offs (branching, and possibility of fitting the entire code into cache), but curious about it and if it would be possible to have a switch that controls how the compilation is done. I believe this is called inlining, but not particularly sure. Also about how it can handle things like dynamic parallelism, how to parallelize do while loops (here there is a comparison before execution), optimization for bitwise operations (efficient usage of registers and caches, depending upon how this result is to be used further in the code), etc.

Some relevant links:

https://stackoverflow.com/questions...ators-sometimes-faster-than-logical-operators

https://en.wikipedia.org/wiki/Inline_expansion

https://stackoverflow.com/questions/1546694/what-is-inlining

https://en.wikipedia.org/wiki/Computer_for_operations_with_functions

https://www.youtube.com/watch?v=eIMpgez61r4

https://www.youtube.com/watch?v=q8p1voEiu8Q

https://www.youtube.com/watch?v=Kr3U2Nz-UIc

Last up, just out of curiosity, I am also interested to learn more about Operating systems for the same. I know to some extent that they block explicit access to cache management (both for stability and security, especially consumer systems like windows, mac and ubuntu) on top of things like input polling for devices like usb mouse, keyboards and other latency hiding, also about scheduling, out of order execution often coupled with hardware level programming. I am looking for a combination of code+hardware(and if possible +OS) that could give me more access, while also being compatible with a fair amount of devices. I happen to like FreeBSD a lot, especially for the ability to load and unload kernels as modules (i know it has it's own trade off, but feels nicer to me). It could, I believe be used to adapt to systems where the hardware comes on and off depending upon the scenario, for energy efficiency use case. I am not very comfortable with linux (I am familiar with bash shell), I find it quite fragmented in organization, while also being bundled and having dependencies. It could be the distros I tried, but the package management has been quite messy (many ways to do the same thing often requiring tweaks). It is also quite confusing to have to use do and sudo, etc, while they superficially do a very similar thing (I know it is for some admin purpose, but it still confuses me a lot). I tried to look into ALSA audio stack for experimenting certain things internally and eventually gave up as it was again a complicated structure to navigate (Microsoft Kernel Streaming, and also Wasapi felt far easier to navigate, though fully low level access is not available). I happen to like modularity and a simple and efficient code clarity, and I don't find that in linux. FreeBSD in comparison was so nice (in the short time I have used it), and the directory organisation made a lot of sense. I am also curious about exploring more about OS in general for these heterogeneous and parallel computing platforms. I find OS like barrelfish (multikernel OS) very interesting as well, being able to support heterogeneous platforms including being able to notice whether the hardware is awake or not. Of course it is currently a project in work, but interesting nevertheless. I might be sticking to linux for the most part due to the support, and pre built customizations available (like OpenWRT linux, etc), but looking to explore whenever I have the chance, both inside and out of linux, especially about kernels (multikernel, separation kernel, etc). Also curious about Operating systems that are natively designed for manycore and heterogeneous processors and utilizing them to the full.

https://en.wikipedia.org/wiki/Self-modifying_code

https://www.youtube.com/watch?v=gnd9LPWv1U8

https://www.youtube.com/watch?v=Rl_NtL3vYZM

TLDR:

Overall the three things I would like to learn, in decreasing order of priority.

1. Parallel programming languages for numerical computing with less distractions and better access to performance improvement customizations. Speed of execution and energy efficiency is high priority, and it would be a bonus if it could be implemented with asychronous tasks, like realtime sampling from sensors through buffers.

2. Learning more about compilers for such parallel programming architecture and languages. Learning more about Operating systems and kernels for parallel processing and heterogeneous architectures.

3. Different parallel processing architectures, for ALU, and for memory and perks and pitfalls of each, along with currently available implementations and devices.

manueljenkin · Jan 16, 2022

Not intending to diverge much, but I am also curious about general pre-built high performance frameworks like PETSc, for physics simulations (primarily optics and fluids, a little bit of quantum/EM). Currently familiar with Zemax, but looking to get into more foundational analysis.

Astronuc · Jan 16, 2022

manueljenkin said:

Summary:: Looking for guidance on Parallel programming languages for numerical computing with less distractions and better access to performance improvement customizations. Speed of execution and energy efficiency is high priority, and it would be a bonus if it could be implemented with asychronous tasks, like realtime sampling from sensors through buffers. Additionally interested in learning more about compilers and architectures for high performance computing.

A lot of the work I am interested to do will be mostly built from scratch by myself, provided there is fair support for numerical types (like complex numbers) and high precision numerical operations (if not, I'll be happy to write those routines as well).

One came to PF prepared!

manueljenkin said:

1. Parallel programming languages for numerical computing with less distractions and better access to performance improvement customizations. Speed of execution and energy efficiency is high priority, and it would be a bonus if it could be implemented with asychronous tasks, like realtime sampling from sensors through buffers.

2. Learning more about compilers for such parallel programming architecture and languages. Learning more about Operating systems and kernels for parallel processing and heterogeneous architectures.

3. Different parallel processing architectures, for ALU, and for memory and perks and pitfalls of each, along with currently available implementations and devices.

That's asking for a lot.

Parallel programming has been been around for awhile, at least a couple of decades, and perhaps longer. In parallel, improvements have been made in the hardware in terms of speed and storage capacity.

I've seen parallel code in Fortran and C++, and one can run simulations on HP supercomputers or clusters.

From item 1, it seems one would like to have a measurement system in parallel to a computational system, perhaps could with feedback. I've seen such systems for nuclear reactor control and fuel performance, as well as nuclear reactor simulations for different performance issues, but one could also develop such system for an aircraft (especially a hypersonic craft) or spacecraft in its operational environment, or in a processing system such as alloy melting and refinement, or a given star, a group of stars, a galactic cloud, . . . . .

A lot depends on the type of system and how complex it is, for example, the set of differential equations (often partial and nonlinear) that describe the physics, how one chooses to solve that set of equations, how complicated the physics, and how accurate one wants the solution. For example, one would be simulating a system under 'steady-state' conditions (over hours, days, months, years), but then introduce a short term transient (that might occur over milliseconds, seconds, hour, or days, or even a combination of transient phenomena), which requires imposing a hugely disparate time scales. Convergence of a solution can get pretty interesting.

pbuk · Jan 16, 2022

I don't think that the OP passes the Turing test.

manueljenkin · Jan 16, 2022

Astronuc said:

From item 1, it seems one would like to have a measurement system in parallel to a computational system, perhaps could with feedback. I've seen such systems for nuclear reactor control and fuel performance, as well as nuclear reactor simulations for different performance issues, but one could also develop such system for an aircraft (especially a hypersonic craft) or spacecraft in its operational environment, or in a processing system such as alloy melting and refinement, or a given star, a group of stars, a galactic cloud,

Learning such adaptive/real-time systems are part of a long-term goal but not of immediate requirement for me. Right now, I am more concerned about writing codes (currently for signal processing tasks) that is both performant and energy efficient assuming the underlying hardware has parallelism (of different kinds). And I am looking to learn the entire system Code->Language->compiler->hardware level to varying extent, both out of curiosity and out of interest to optimize better (contribute to different domains if there are active projects)

pbuk said:

I don't think that the OP passes the Turing test.

I wish my computer came up with this massive writeup by itself without me having to frame every specification for days together (also taking effort to learn at least one of the languages to fair extent). I made it large to accommodate all relevant domains I could think of that could break improvement in other areas, by being the weak link. Added reference links so that people referring to this thread could benefit even if they couldn't contribute.

pbuk · Jan 17, 2022

OK, I am going to take this at face value. You need to focus, and here are some suggestions.

You need a proper academic grounding in numerical methods, try https://ocw.mit.edu/courses/mathema...on-to-numerical-methods-spring-2019/index.htm (your comment "NumPy can accumulate errors and show nonzero value where it should be zero" makes it clear that you have some gaps here: those errors are a necessary consquence of finite precision arithmetic and the fact that Matlab decides to hide them from you is not always desireable).

Languages to focus on first are Julia (which is used by the course above) and C(++) with CUDA.

You need to ensure that you do not work in isolation; if you can afford it then subscription to the ACM with the Digital Library, and in particular access to Transactions on Parallel Computing could be one way to broaden and deepen your resources.

manueljenkin · Aug 25, 2022

pbuk said:

OK, I am going to take this at face value. You need to focus, and here are some suggestions.

You need a proper academic grounding in numerical methods, try https://ocw.mit.edu/courses/mathema...on-to-numerical-methods-spring-2019/index.htm (your comment "NumPy can accumulate errors and show nonzero value where it should be zero" makes it clear that you have some gaps here: those errors are a necessary consquence of finite precision arithmetic and the fact that Matlab decides to hide them from you is not always desireable).

Languages to focus on first are Julia (which is used by the course above) and C(++) with CUDA.

You need to ensure that you do not work in isolation; if you can afford it then subscription to the ACM with the Digital Library, and in particular access to Transactions on Parallel Computing could be one way to broaden and deepen your resources.

Few months later and learning convex optimisation, I have started loving Julia (also functional programming, been learning a bit of haskell as well, and the ability to generate optimal code in one line). Also realized that there’s this powerful thing called LLVM!

Anyone with similar dilemma as me, I highly recommend trying out Julia. It is very fast.

anorlunda · Aug 25, 2022

How do beginners get started in any field?

Most sign up for a course . Most do not use independent study and asking questions as their primary strategy.
.

manueljenkin · Aug 25, 2022

anorlunda said:

How do beginners get started in any field?

Most sign up for a course . Most do not use independent study and asking questions as their primary strategy.
.

Agreed. I made this post after finishing a course in CUDA and OpenCL

Requesting suggestions for languages, libraries, and architectures for parallel (and sometimes non parallel) numerical and scientific computations

1. What are some popular languages used for parallel and numerical computations?

2. Are there any specific libraries that are recommended for scientific computations?

3. What are some important factors to consider when choosing an architecture for parallel computations?

4. How can I optimize my code for parallel computations?

5. Are there any resources available for learning about parallel and numerical computations?

Similar threads

Hot Threads

Recent Insights